Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Help please with broken RAID5 array
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
Goverp
l33t
l33t


Joined: 07 Mar 2007
Posts: 804

PostPosted: Tue Aug 06, 2019 10:19 pm    Post subject: Help please with broken RAID5 array Reply with quote

Hi folks, I've a broken 4-disk mdadm RAID 5 array. The message I get is:
Code:
mdadm: /dev/md127 assembled from 3 drives - not enough to start the array while not clean - consider --force.

Background is that my DVD drive stopped working; this morning I replaced its cable and it came back to life. Job done - or not :-(
Its SATA cable was the grey one, the disks used red ones. Except one disk also had a grey one, so what I'd actually done is pulled the disk cable off the motherboard and plugged the DVD drive instead. Which means the array was degraded, 3 drives out of 4, but I didn't notice. (Memo to self: write something that sticks a dirty great message on the desktop when that happens!)

So at some point this afternoon something went wrong, and so mdadm barfed as two disks now have different event counts. Specifically:
Code:
/dev/sda3: 213959
/dev/sdb3: 220678
/dev/sdc3: 220678
/dev/sdd3: 220679

The device checksums are OK. Disks b and c are consistent, disk d somehow has one event more (not sure what happened - I think the freezer turned one part way through booting maybe threw a power glitch at the PC.
The question is, how safe is it to follow the suggestion to use --force, and if it is OK, what's the best incantation?
If all else fails, I have a backup from this morning. Yea! :D but I still did some work today, and I'd rather not try to guess what it was...

My guess is to try:
Code:
mdadm --assemble --force /dev/md127 /dev/sdb3 /dev/sdc3 /dev/sdd3 /dev/sda3
and then run an fsck.ext4. If I'm right that the power glitch interrupted something during boot, the only updates that might be in the extra event are transitory updates that would have been overwritten at the next boot.

Thanks for any advice and guidance.

For completeness, there's the mdadm --examine output, as I read Neddy Seagoon advising someone in a similar situation to post it in case it's later needed for reference:
Code:
root@paul-Aspire-M3201:~# mdadm --examine /dev/sda3
/dev/sda3:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : c80d66f1:9bc95138:78a82083:d3cc09dc
           Name : acer:gentoo  (local to host acer)
  Creation Time : Sun Mar 21 10:51:13 2010
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 524281000 (250.00 GiB 268.43 GB)
     Array Size : 786421440 (749.99 GiB 805.30 GB)
  Used Dev Size : 524280960 (250.00 GiB 268.43 GB)
   Super Offset : 524281256 sectors
   Unused Space : before=0 sectors, after=296 sectors
          State : clean
    Device UUID : c73279a8:40e0c514:455cf43f:90d85e02

Internal Bitmap : 2 sectors from superblock
    Update Time : Mon Aug  5 20:38:24 2019
       Checksum : 198a9223 - correct
         Events : 213959

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 2
   Array State : AAAA ('A' == active, '.' == missing, 'R' == replacing)
root@paul-Aspire-M3201:~# mdadm --examine /dev/sdb3


/dev/sdb3:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : c80d66f1:9bc95138:78a82083:d3cc09dc
           Name : acer:gentoo  (local to host acer)
  Creation Time : Sun Mar 21 10:51:13 2010
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 524281000 (250.00 GiB 268.43 GB)
     Array Size : 786421440 (749.99 GiB 805.30 GB)
  Used Dev Size : 524280960 (250.00 GiB 268.43 GB)
   Super Offset : 524281256 sectors
   Unused Space : before=0 sectors, after=296 sectors
          State : active
    Device UUID : fc377d41:9dd67ea4:01b65109:0e56d006

Internal Bitmap : 2 sectors from superblock
    Update Time : Tue Aug  6 18:36:35 2019
       Checksum : 1018b414 - correct
         Events : 220678

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 0
   Array State : AA.A ('A' == active, '.' == missing, 'R' == replacing)


root@paul-Aspire-M3201:~# mdadm --examine /dev/sdc3
/dev/sdc3:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : c80d66f1:9bc95138:78a82083:d3cc09dc
           Name : acer:gentoo  (local to host acer)
  Creation Time : Sun Mar 21 10:51:13 2010
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 524281000 (250.00 GiB 268.43 GB)
     Array Size : 786421440 (749.99 GiB 805.30 GB)
  Used Dev Size : 524280960 (250.00 GiB 268.43 GB)
   Super Offset : 524281256 sectors
   Unused Space : before=0 sectors, after=296 sectors
          State : active
    Device UUID : 971b8827:c03d3ebf:09ae0ab2:4c3555bc

Internal Bitmap : 2 sectors from superblock
    Update Time : Tue Aug  6 18:36:35 2019
       Checksum : 6f20d61d - correct
         Events : 220678

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 3
   Array State : AA.A ('A' == active, '.' == missing, 'R' == replacing)


root@paul-Aspire-M3201:~# mdadm --examine /dev/sdd3
/dev/sdd3:
          Magic : a92b4efc
        Version : 1.0
    Feature Map : 0x1
     Array UUID : c80d66f1:9bc95138:78a82083:d3cc09dc
           Name : acer:gentoo  (local to host acer)
  Creation Time : Sun Mar 21 10:51:13 2010
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 524281000 (250.00 GiB 268.43 GB)
     Array Size : 786421440 (749.99 GiB 805.30 GB)
  Used Dev Size : 524280960 (250.00 GiB 268.43 GB)
   Super Offset : 524281256 sectors
   Unused Space : before=0 sectors, after=296 sectors
          State : clean
    Device UUID : dec77a83:23ce45c9:dc9cebb0:0ccb9d35

Internal Bitmap : 2 sectors from superblock
    Update Time : Tue Aug  6 18:36:37 2019
       Checksum : 4d44975e - correct
         Events : 220679

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 1
   Array State : AA.A ('A' == active, '.' == missing, 'R' == replacing)

_________________
Greybeard
Back to top
View user's profile Send private message
Goverp
l33t
l33t


Joined: 07 Mar 2007
Posts: 804

PostPosted: Wed Aug 07, 2019 12:26 pm    Post subject: Reply with quote

Oh well, ran out of time.

Did
Code:
mdadm --assemble --force /dev/md127 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3
# which assembled it with three disks, b,c and d, so:
mdadm --manage /dev/md127 --re-add /dev/sda3
# which worked; took about 5 mins, maybe less, to sync.
fsck.ext4 -f /dev/md127p{1234}
# worked OK, and nothing in lost+found anywhere

Then booted in single-user mode, and used sys-archive/restore to compare my incremental backup from (it turns out) last Saturday; as far as I could see, all the changed files looked reasonable. Shame I couldn't pipe its output to "less", so the 6000 changes to firefox's cache scrolled past too fast to check individually :-)
_________________
Greybeard
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 44921
Location: 56N 3W

PostPosted: Wed Aug 07, 2019 6:08 pm    Post subject: Reply with quote

Goverp,

The only thing I would have done differently is to assemble the raid read only and have a look around to assess the damage.
Its the raid set that made read only, so the filesystem can't even replay the journal.

It sounds like you were lucky.

Do you run with a write intent bitmap?
Its a very good thing and can can be added at any time.

/proc/mdstat:
md127 : active raid5 sda6[0] sdd6[3] sdc6[2] sdb6[1]
      2912833152 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 3/8 pages [12KB], 65536KB chunk

_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Goverp
l33t
l33t


Joined: 07 Mar 2007
Posts: 804

PostPosted: Thu Aug 08, 2019 8:35 am    Post subject: Reply with quote

NeddySeagoon wrote:
...Do you run with a write intent bitmap?...
Yes. Bit maps appear very useful - after the --re-add, /proc/mdstat estimated the resulting scan to take about an hour, but it actually only took 5 minutes, and I think that's down to the bitmap. Much of my home directory is gigbytes of photos and videos, and of course they don't change, so the corresponding bitmap entries will be clear, enabling the scan to skip them.

Given the performance impact of a scan on my machine, that's a big improvement. (It may be changing the I/O scheduler to BFQ would reduce the impact, but with the default, the machine is definitely unusable until the scan is complete.)
_________________
Greybeard
Back to top
View user's profile Send private message
Goverp
l33t
l33t


Joined: 07 Mar 2007
Posts: 804

PostPosted: Thu Aug 08, 2019 2:32 pm    Post subject: Reply with quote

The plot thickens!
I run /etc/init.d/mdadm. It's supposed to monitor the array; the relevant lines are
Code:
 mdadm --monitor --scan \
                --daemonise \
                --pid-file /var/run/mdadm.pid \
                ${MDADM_OPTS}

and I have MDADM_OPTS="--syslog" in /etc/conf.d/mdadm. So my intent was for problems detected by mdadm --monitor, such as running in degraded mode, to be logged. And if I set up syslog-ng correctly, they could be broadcast somewhere. But (a) my syslog is set to record, not broadcast, and (b) "mdadm --monitor --scan" exits immediately unless either --mail or --program is specfied. So there's no monitor!
I can fix (a) - the right thing to do is get syslog-ng to issue wall or write on critical or above messages; KDE notifications can be configured to ring the bell and pop up a message.

(b) is, surprisingly, more complex; deleting "--scan" means mdadm will use mdadm.conf to decide what to monitor, which in my case is fine, and presumably not stop due to lack of program or mail. Or I could set a mail id or run a futile program. But none of those is perfect, 'cos mdadm --monitor only reports once, which may not be enough. "man mdadm" sugggests running "mdadm --monitor --oneshot" in a cron job.

Probably one message per boot session is good enough, so I could either delete the --scan, or set --program (though the man page doesn't suggest what to set it to), or of course use --mail
Oh, and reinstall sys-apps/util-linux with USE=+tty-helpers and reconfigure syslog-ng (or maybe change my logger - I'm beginning to think trailing-edge software might be better...)
_________________
Greybeard
Back to top
View user's profile Send private message
Goverp
l33t
l33t


Joined: 07 Mar 2007
Posts: 804

PostPosted: Fri Aug 09, 2019 7:24 am    Post subject: Reply with quote

Updated /etc/conf.d/mdadm to include
Code:
MDADM_OPTS="--syslog --mail root"
so in theory it should write a (critical) message to syslog if the array is degraded.
Reinstalled sys-apps/util-linux USE="+tty-helpers", so now I have wall and write. Added a simple script /usr/local/bin/wallWriter.sh
Code:
#! /bin/sh

while true
do
  read message
  wall -n $message
done
and updated /ec/syslog-ng/syslog-ng.conf to include
Code:
destination wallDestination { program("/usr/local/bin/wallWriter.sh"); };
filter criticalFilter { "$LEVEL_NUM" < "3" };
log { source(systemSource); filter(criticalFilter); destination(wallDestination); };

Now a critical message gets broadcast, like I feel it should. It appears in a pop-up window on my desktop. and then disappears after about 5 seconds :evil: . Damned KDE knows better than me, and has no settings (in System settings->Notifications) to keep the message showing! Idiots!
_________________
Greybeard
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum