Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[SOLVED] mdadm RAID1 - replacing a failed drive
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
cami
n00b
n00b


Joined: 15 Jan 2005
Posts: 36

PostPosted: Tue Jul 26, 2016 4:38 pm    Post subject: [SOLVED] mdadm RAID1 - replacing a failed drive Reply with quote

So just one day after I had my RAID setup up (see [SOLVED] How to properly boot a custom initramfs?), a disk failed permanently. I've installed an identical replacement, but I cannot figure out how to make it being used. The idea behind using RAID was to make this easy, but I tried really hard and I only find more and more questions instead of answers.

I initially created a full-disk RAID-1 on two identical disks using intel storage manager (X58 chipset).

Code:
mdadm --examine /dev/sda
/dev/sda:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.1.00
    Orig Family : 1d385601
         Family : 83eb12c3
     Generation : 00005e60
     Attributes : All supported
           UUID : eb43e025:cff7929e:9af766e7:f2d60015
       Checksum : d65c6715 correct
    MPB Sectors : 2
          Disks : 2
   RAID Devices : 1

  Disk01 Serial : PK1134P6JWDGUW
          State : active
             Id : 00010000
    Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)

[Volume0]:
           UUID : 083b0d35:d926f293:ef50839b:4f023f76
     RAID Level : 1 <-- 1
        Members : 2 <-- 2
          Slots : [UU] <-- [__]
    Failed disk : 1
      This Slot : 1 (out-of-sync)
     Array Size : 3907022848 (1863.01 GiB 2000.40 GB)
   Per Dev Size : 3907023112 (1863.01 GiB 2000.40 GB)
  Sector Offset : 0
    Num Stripes : 15261808
     Chunk Size : 64 KiB <-- 64 KiB
       Reserved : 0
  Migrate State : rebuild
      Map State : degraded <-- degraded
     Checkpoint : 0 (512)
    Dirty State : dirty

  Disk00 Serial : 134P6JVNVHW:0:0
          State : active failed
             Id : ffffffff
    Usable Size : 3907022936 (1863.01 GiB 2000.40 GB)


The last lines represent the failed disk. It doesn't physically exist anymore. The other disk ( Disk01 Serial : PK1134P6JWDGUW) is attached as /dev/sda. The new drive is attached as /dev/sdb, but not used in any way yet.

Code:
NAME          MAJ:MIN  RM  SIZE RO TYPE  MOUNTPOINT
sda             8:0     0  1,8T  0 disk 
└─md_d127     254:8128  0  1,8T  0 raid1
  ├─md_d127p1 254:8129  0 1023M  0 md   
  ├─md_d127p2 254:8130  0   31G  0 md    [SWAP]
  └─md_d127p3 254:8131  0  1,8T  0 md    /
sdb             8:16    0  1,8T  0 disk 



Intel storage manager UI only lets me create or delete arrays, but not replace drives. So I have to do this using mdadm somehow. I already ran
Code:
mdadm --manage /dev/md127 --remove failed

It exited without saying anything. I'm not sure whether it did something.

The first thing I do not understand is why there is a separate "container" (md127) and an "array" (md_d127), what each of these are and when to use which. Most sources on the net have just one "md0". Documentation on containers is very brief.

The second thing I do not understand is the output of /proc/mdstat, mdadm --detail and mdadm --examine. Documentation doesnt explain very well what the differences are and how to interpret the output. As far as I understood --examine reads a data block from the physical drives. Couldn't figure out what --detail and /proc/mdstat do.

Code:
cat /proc/mdstat
Personalities : [raid1] [raid10] [raid6] [raid5] [raid4]
md_d127 : active raid1 sda[0]
      1953511424 blocks super external:/md127/0 [2/1] [U_]
     
md127 : inactive sda[0](S)
      3028 blocks super external:imsm
       
unused devices: <none>


Code:
mdadm --detail /dev/md127
/dev/md127:
        Version : imsm
     Raid Level : container
  Total Devices : 1

Working Devices : 1


           UUID : eb43e025:cff7929e:9af766e7:f2d60015
  Member Arrays : /dev/md/Volume0_0

    Number   Major   Minor   RaidDevice

       0       8        0        -        /dev/sda


Code:
mdadm --detail /dev/md_d127
/dev/md_d127:
      Container : /dev/md/imsm0, member 0
     Raid Level : raid1
     Array Size : 1953511424 (1863.01 GiB 2000.40 GB)
  Used Dev Size : 1953511556 (1863.01 GiB 2000.40 GB)
   Raid Devices : 2
  Total Devices : 1

          State : active, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0


           UUID : 083b0d35:d926f293:ef50839b:4f023f76
    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       2       0        0        2      removed


I can add and remove /dev/sdb to the container /dev/md127, but that doesn't seem to affect the actual array.

Code:
cami ~ # cat /proc/mdstat
Personalities : [raid1] [raid10] [raid6] [raid5] [raid4]
md_d127 : active raid1 sda[0]
      1953511424 blocks super external:/md127/0 [2/1] [U_]
     
md127 : inactive sdb[1](S) sda[0](S)
      6056 blocks super external:imsm
       
unused devices: <none>
cami ~ # lsblk
NAME          MAJ:MIN  RM  SIZE RO TYPE  MOUNTPOINT
sda             8:0     0  1,8T  0 disk 
└─md_d127     254:8128  0  1,8T  0 raid1
  ├─md_d127p1 254:8129  0 1023M  0 md   
  ├─md_d127p2 254:8130  0   31G  0 md    [SWAP]
  └─md_d127p3 254:8131  0  1,8T  0 md    /
sdb             8:16    0  1,8T  0 disk 
cami ~ # mdadm --detail /dev/md127
/dev/md127:
        Version : imsm
     Raid Level : container
  Total Devices : 2

Working Devices : 2


           UUID : eb43e025:cff7929e:9af766e7:f2d60015
  Member Arrays : /dev/md/Volume0_0

    Number   Major   Minor   RaidDevice

       0       8        0        -        /dev/sda
       1       8       16        -        /dev/sdb
cami ~ # mdadm --detail /dev/md_d127
/dev/md_d127:
      Container : /dev/md/imsm0, member 0
     Raid Level : raid1
     Array Size : 1953511424 (1863.01 GiB 2000.40 GB)
  Used Dev Size : 1953511556 (1863.01 GiB 2000.40 GB)
   Raid Devices : 2
  Total Devices : 1

          State : active, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0


           UUID : 083b0d35:d926f293:ef50839b:4f023f76
    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sda
       2       0        0        2      removed
cami ~ # mdadm --examine /dev/sdb
/dev/sdb:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.0.00
    Orig Family : 00000000
         Family : e3724720
     Generation : 00000001
     Attributes : All supported
           UUID : 00000000:00000000:00000000:00000000
       Checksum : 01a96b92 correct
    MPB Sectors : 1
          Disks : 1
   RAID Devices : 0

  Disk00 Serial : PK1134P6JVNVHW
          State : spare
             Id : 03000000
    Usable Size : 3907026958 (1863.02 GiB 2000.40 GB)

    Disk Serial : PK1134P6JVNVHW
          State : spare
             Id : 03000000
    Usable Size : 3907026958 (1863.02 GiB 2000.40 GB)


The raid contains the root filesystem, so it's not easy to stop/reassemble th array, albeit possible (using a boot CD). I hoped for an easy solution. Easy replacement was the idea behind the setup, after all. But so far I haven't found any solution at all that doesn't require recreating the array and losing the data.


Last edited by cami on Wed Jul 27, 2016 12:05 pm; edited 2 times in total
Back to top
View user's profile Send private message
frostschutz
Advocate
Advocate


Joined: 22 Feb 2005
Posts: 2970
Location: Germany

PostPosted: Tue Jul 26, 2016 4:49 pm    Post subject: Reply with quote

Is there a windows on this machine? For Linux it's best to stick to the native format, not using any intel storage manager.

You don't have to remove failed. Just ignore it.

My guess is that you need `mdadm /dev/md_d127 --add /dev/sdb` but I could be wrong because I don't use imsm format.
Back to top
View user's profile Send private message
cami
n00b
n00b


Joined: 15 Jan 2005
Posts: 36

PostPosted: Tue Jul 26, 2016 7:46 pm    Post subject: Reply with quote

Well I already issued the suggested command without achieving the desired result (see OP for details).

I already noticed imsm might not have been the best choice but now I'm kind of stuck with it.
Back to top
View user's profile Send private message
frostschutz
Advocate
Advocate


Joined: 22 Feb 2005
Posts: 2970
Location: Germany

PostPosted: Tue Jul 26, 2016 8:20 pm    Post subject: Reply with quote

You're not really showing that in your post... and you only talk of adding to md127 not md_dangnabbit127

If that doesn't work, could you show output for file -s and parted print for each disk?

Code:

for disk in /dev/sda* /dev/sdb* /dev/md* /dev/md*/*
do
    file -sL "$disk"
    parted "$disk" unit s print free
done
Back to top
View user's profile Send private message
cami
n00b
n00b


Joined: 15 Jan 2005
Posts: 36

PostPosted: Tue Jul 26, 2016 8:56 pm    Post subject: Reply with quote

Oh sorry, I overlooked that bit. It is not possible to --add the array directy, mdadm says i shall add to the container instead.

I will post the output of the requested commands tomorrow. Note however that it's full-disk raid. I also included the outputs of mdadm --examine of the two disks in the OP, maybe that helps for the time being.
Back to top
View user's profile Send private message
frostschutz
Advocate
Advocate


Joined: 22 Feb 2005
Posts: 2970
Location: Germany

PostPosted: Tue Jul 26, 2016 10:05 pm    Post subject: Reply with quote

cami wrote:
Note however that it's full-disk raid.


There are currently two threads on the linux-raid mailing list by people who destroyed their RAID due to it being a full-disk raid. ( http://www.spinics.net/lists/raid/msg53033.html http://www.spinics.net/lists/raid/msg53046.html )

Their mistake: They partitioned their full disk RAID with GPT, then ran a partitioner on ... the full disk.

Partitioner sees GPT data at either start or end of the disk (GPT keeps a backup at the end), and restores/rebuilds the "missing" GPT on the other end of the disk - and there goes your RAID metadata bye-bye.

I never do full-disk RAID, or full-disk anything for that matter, there's just too many ways for it to go wrong in unexpected ways. Always use a partition table.

My suggestion for you is to bite the bullet and do it over. If your current RAID is still working, you can use sdb to build a new structure from scratch, this time with a traditional disk -> partitions -> md -> filesystem structure.
Back to top
View user's profile Send private message
cami
n00b
n00b


Joined: 15 Jan 2005
Posts: 36

PostPosted: Wed Jul 27, 2016 8:55 am    Post subject: Reply with quote

Thanks for your advice. I already noticed the setup choices might not have been the best.

For completeness, I did not do anything fancy with the disks, I only swapped the failed drive. The RAID is still working, only degraded. So this is basically the standard situation RAID is designed for.

I strongly doubt it has anything to do with partitions however, and that I would have the exact same problem if it were sda1 and sdb1 instead.

So I'm still looking for a proper solution without starting over. If the only solution would be starting over, RAID 1 would be pointless, and a standard backup would be more efficient. So could we assume it's unrelated to partitioning and pretend the raid is on an individual partition?
Back to top
View user's profile Send private message
cami
n00b
n00b


Joined: 15 Jan 2005
Posts: 36

PostPosted: Wed Jul 27, 2016 12:01 pm    Post subject: Reply with quote

Update. I was able to solve the problem today, although I still don't understand what happened. Here's what I did:

  1. Booted the system using a Gentoo LiveCD
  2. Noticed that the LIVECD found two containers (imsm0 and imsm1) and one volume (Volume0_0)
    Code:
    $ ls /dev/md
    Volume0_0 imsm0 imsm1

  3. Found that Volume0_0 was using container imsm1
    Code:
    mdadm --detail /dev/md/Volume0_0

  4. Checked metadata on /dev/sda and /dev/sdb (see OP for outputs)
    Code:
    $ mdadm --examine /dev/sda
    ...
    $ mdadm --examine /dev/sdb
    ...

  5. Observed that /dev/sda contained the Intel Storage Manager metadata for my RAID, with the first disk missing and the second disk being /dev/sda itself. (see also OP)
  6. Observed that /dev/sdb contained Intel Storage Manager metadata for a spare without any assigned volume (see also OP)
  7. Assumed that container imsm0 consisted of the spare /dev/sdb
  8. Stopped container /dev/md/imsm0
    Code:
    mdadm --manage /dev/md/imsm0 --stop

  9. Added /dev/sdb to container /dev/md/imsm1
    Code:
    mdadm --manage /dev/md/imsm1 --add /dev/sdb

  10. I could hear that this started a rebuild.
  11. I don't know why this didn't work while the system was running, somehow mdadm must have added /dev/sdb to a new container instead of the one I specified. The LiveCD and my system use different versions of mdadm, maybe a bug?
    Code:
    # mdadm --version # this is the potentially buggy version                                                                     
    mdadm - v3.3.1 - 5th June 2014

  12. Checked what was going on
    Code:
    # cat /proc/mdstat                                                                                                           
    Personalities : [raid1] [raid10] [raid6] [raid5] [raid4]                                                                             
    md125 : active raid1 sdb[1] sda[0]                                                                                                 
          1953511424 blocks super external:/md126/0 [2/1] [_U]                                                                           
          [==>..................]  recovery = 10.8% (212502848/1953511556) finish=223.3min speed=129900K/sec                             
                                                                                                                                         
    md126 : inactive sda[1](S) sdb[0](S)                                                                                                 
          6056 blocks super external:imsm                                                                                               
                                                                                                                                         
    unused devices: <none>

  13. Checked disk metadata
    Code:
    cami ~ # mdadm --examine /dev/sda                                                                                                   
    /dev/sda:                                                                                                                           
              Magic : Intel Raid ISM Cfg Sig.                                                                                           
            Version : 1.1.00                                                                                                             
        Orig Family : 1d385601                                                                                                           
             Family : 5a6ea771                                                                                                           
         Generation : 00005e95                                                                                                           
         Attributes : All supported                                                                                                     
               UUID : eb43e025:cff7929e:9af766e7:f2d60015                                                                               
           Checksum : 60a930ae correct                                                                                                   
        MPB Sectors : 2                                                                                                                 
              Disks : 2                                                                                                                 
       RAID Devices : 1

      Disk01 Serial : PK1134P6JWDGUW
              State : active
                 Id : 00010000
        Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)

    [Volume0]:
               UUID : 083b0d35:d926f293:ef50839b:4f023f76
         RAID Level : 1 <-- 1
            Members : 2 <-- 2
              Slots : [UU] <-- [_U]
        Failed disk : 0
          This Slot : 1
         Array Size : 3907022848 (1863.01 GiB 2000.40 GB)
       Per Dev Size : 3907023112 (1863.01 GiB 2000.40 GB)
      Sector Offset : 0
        Num Stripes : 15261808
         Chunk Size : 64 KiB <-- 64 KiB
           Reserved : 0
      Migrate State : rebuild
          Map State : normal <-- degraded
         Checkpoint : 787874 (512)
        Dirty State : dirty

      Disk00 Serial : PK1134P6JVNVHW
              State : active
                 Id : 00030000
        Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)
    cami ~ # mdadm --examine /dev/sdb
    /dev/sdb:
              Magic : Intel Raid ISM Cfg Sig.
            Version : 1.1.00
        Orig Family : 1d385601
             Family : 5a6ea771
         Generation : 00005e95
         Attributes : All supported
               UUID : eb43e025:cff7929e:9af766e7:f2d60015
           Checksum : 60a930ae correct
        MPB Sectors : 2
              Disks : 2
       RAID Devices : 1

      Disk00 Serial : PK1134P6JVNVHW
              State : active
                 Id : 00030000
        Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)

    [Volume0]:
               UUID : 083b0d35:d926f293:ef50839b:4f023f76
         RAID Level : 1 <-- 1
            Members : 2 <-- 2
              Slots : [UU] <-- [_U]
        Failed disk : 0
          This Slot : 0 (out-of-sync)
         Array Size : 3907022848 (1863.01 GiB 2000.40 GB)
       Per Dev Size : 3907023112 (1863.01 GiB 2000.40 GB)
      Sector Offset : 0
        Num Stripes : 15261808
         Chunk Size : 64 KiB <-- 64 KiB
           Reserved : 0
      Migrate State : rebuild
          Map State : normal <-- degraded
         Checkpoint : 787874 (512)
        Dirty State : dirty

      Disk01 Serial : PK1134P6JWDGUW
              State : active
                 Id : 00010000
        Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)

  14. Tested the array by mounting the partitions and accessing some files and directories.
    Code:
     # lsblk
    NAME          MAJ:MIN  RM  SIZE RO TYPE  MOUNTPOINT
    sda             8:0     0  1,8T  0 disk 
    └─md125       254:8128  0  1,8T  0 raid1
      ├─md125p1   254:8129  0 1023M  0 md   
      ├─md125p2   254:8130  0   31G  0 md    [SWAP]
      └─md125p3   254:8131  0  1,8T  0 md    /
    sdb             8:16    0  1,8T  0 disk 
    └─md125       254:8128  0  1,8T  0 raid1
      ├─md125p1   254:8129  0 1023M  0 md   
      ├─md125p2   254:8130  0   31G  0 md    [SWAP]
      └─md125p3   254:8131  0  1,8T  0 md    /
    # mount /dev/md/Volume0_0p3 /mnt/gentoo # /dev/md/Volume0_0p3 symlinks to /dev/md125p3
    # ...
    # umount /mnt/gentoo

  15. I didn't want to wait for recovery to finish, so I stopped the array, checked everything was offline, checked metadata again, and rebooted.
    Code:
    mdadm --manage /dev/md/Volume0_0 --stop
    mdadm --manage /dev/md/imsm1 --stop
    cat /proc/mdstat # this should say "no devices"
    mdadm --examine /dev/sda
    mdadm --examine /dev/sdb
    reboot

  16. During boot, Intel Storage Manager showed the RAID with both disks attached and state "Rebuild" (i.e. recovery), and the system came up normally.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum