Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Raid 1 sync problem after replacing failed drive
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
kilburna
Tux's lil' helper
Tux's lil' helper


Joined: 20 Sep 2004
Posts: 107

PostPosted: Thu Feb 11, 2016 7:27 am    Post subject: Raid 1 sync problem after replacing failed drive Reply with quote

Hi

I have a raid 1 in a small server as below. When I add the md0 and md1 back after replacing the drive, sync completes. But when I do mdadm /dev/md2 --add /dev/sdb3 the check with cat /proc/mdstat it will get up to 92% then silently exit. Every time I restart the server the sync will start and silently fail at 92%.

I also tried to execute echo check >> /sys/block/md2/md/sync_action to check for bad blocks, but when I execute this it does not seem to do anything seen from cat/proc/mdstat.

Any pointers on how to proceed?

Code:

EBox ~ # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb2[1] sda2[0]
      6143936 blocks [2/2] [UU]

md2 : active raid1 sdb3[2](S) sda3[0]
      306393344 blocks [2/1] [U_]

md0 : active raid1 sdb1[1] sda1[0]
      32704 blocks [2/2] [UU]

unused devices: <none>


Code:

EBox ~ # mdadm --detail /dev/md2
/dev/md2:
        Version : 0.90
  Creation Time : Thu Oct 18 07:51:26 2012
     Raid Level : raid1
     Array Size : 306393344 (292.20 GiB 313.75 GB)
  Used Dev Size : 306393344 (292.20 GiB 313.75 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 2
    Persistence : Superblock is persistent

    Update Time : Thu Feb 11 18:20:21 2016
          State : clean, degraded
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

           UUID : f1348aae:8e9ffe0f:cb201669:f728008a
         Events : 0.42206763

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       -       0        0        1      removed

       2       8       19        -      spare   /dev/sdb3


Regards
kilburna
Back to top
View user's profile Send private message
kilburna
Tux's lil' helper
Tux's lil' helper


Joined: 20 Sep 2004
Posts: 107

PostPosted: Thu Feb 11, 2016 8:06 am    Post subject: Reply with quote

Done a dmesg and noticed this

Code:

[ 4179.275418] ata1.00: failed command: READ FPDMA QUEUED
[ 4179.275424] ata1.00: cmd 60/00:78:80:08:d6/05:00:22:00:00/40 tag 15 ncq 655360 in
                        res 41/40:00:b2:09:d6/00:05:22:00:00/00 Emask 0x409 (media error) <F>
[ 4179.275435] ata1.00: status: { DRDY ERR }
[ 4179.275436] ata1.00: error: { UNC }
[ 4179.275438] ata1.00: failed command: READ FPDMA QUEUED
[ 4179.275441] ata1.00: cmd 60/00:80:80:0d:d6/05:00:22:00:00/40 tag 16 ncq 655360 in
                        res 41/04:00:b2:09:d6/00:00:22:00:00/00 Emask 0x1 (device error)
[ 4179.275443] ata1.00: status: { DRDY ERR }
[ 4179.275444] ata1.00: error: { ABRT }
[ 4179.275445] ata1.00: failed command: READ FPDMA QUEUED
[ 4179.275448] ata1.00: cmd 60/80:88:80:12:d6/04:00:22:00:00/40 tag 17 ncq 589824 in
                        res 41/04:00:b2:09:d6/00:00:22:00:00/00 Emask 0x1 (device error)
[ 4179.275450] ata1.00: status: { DRDY ERR }
[ 4179.275451] ata1.00: error: { ABRT }
[ 4179.275452] ata1.00: failed command: READ FPDMA QUEUED
[ 4179.275455] ata1.00: cmd 60/80:90:00:17:d6/00:00:22:00:00/40 tag 18 ncq 65536 in
                        res 41/04:00:b2:09:d6/00:00:22:00:00/00 Emask 0x1 (device error)
[ 4179.275456] ata1.00: status: { DRDY ERR }
[ 4179.275457] ata1.00: error: { ABRT }
[ 4179.275459] ata1.00: failed command: READ FPDMA QUEUED
[ 4179.275461] ata1.00: cmd 60/80:98:80:17:d6/00:00:22:00:00/40 tag 19 ncq 65536 in
                        res 41/04:00:b2:09:d6/00:00:22:00:00/00 Emask 0x1 (device error)
[ 4179.275463] ata1.00: status: { DRDY ERR }
[ 4179.275464] ata1.00: error: { ABRT }
[ 4179.301951] ata1.00: configured for UDMA/133
[ 4179.301977] sd 0:0:0:0: [sda] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4179.301982] sd 0:0:0:0: [sda] tag#15 Sense Key : Medium Error [current] [descriptor]
[ 4179.301986] sd 0:0:0:0: [sda] tag#15 Add. Sense: Unrecovered read error - auto reallocate failed
[ 4179.301997] sd 0:0:0:0: [sda] tag#15 CDB: Read(10) 28 00 22 d6 08 80 00 05 00 00
[ 4179.301999] blk_update_request: I/O error, dev sda, sector 584452530
[ 4179.302011] ata1: EH complete
[ 4181.875511] ata1.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x0
[ 4181.875515] ata1.00: irq_stat 0x40000008
[ 4181.875518] ata1.00: failed command: READ FPDMA QUEUED
[ 4181.875524] ata1.00: cmd 60/08:58:b0:09:d6/00:00:22:00:00/40 tag 11 ncq 4096 in
                        res 41/40:08:b2:09:d6/00:00:22:00:00/00 Emask 0x409 (media error) <F>
[ 4181.875527] ata1.00: status: { DRDY ERR }
[ 4181.875529] ata1.00: error: { UNC }
[ 4181.879253] ata1.00: configured for UDMA/133
[ 4181.879267] sd 0:0:0:0: [sda] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4181.879271] sd 0:0:0:0: [sda] tag#11 Sense Key : Medium Error [current] [descriptor]
[ 4181.879275] sd 0:0:0:0: [sda] tag#11 Add. Sense: Unrecovered read error - auto reallocate failed
[ 4181.879291] sd 0:0:0:0: [sda] tag#11 CDB: Read(10) 28 00 22 d6 09 b0 00 00 08 00
[ 4181.879293] blk_update_request: I/O error, dev sda, sector 584452530
[ 4181.879303] ata1: EH complete
[ 4181.879333] md/raid1:md2: sda: unrecoverable I/O read error for block 572096896
[ 4181.879359] md: md2: recovery interrupted.


This is probably why the sync is not completing. sda3 has an I/O read error. Is I/O errors something that echo check >> /sys/block/md2/md/sync_action should take care of.

Is there a way to force sdb3 to be add the array.
Back to top
View user's profile Send private message
salahx
Guru
Guru


Joined: 12 Mar 2005
Posts: 437

PostPosted: Sat Feb 13, 2016 10:42 pm    Post subject: Reply with quote

Unfortunately it appears that md2 only has 1 "leg", /dev/sda3, as the other leg is spare. So the only "leg" you have is bad. I recommend you back up /dev/sda3. If you just have a small handful of bad sectors, you can repair them with hdparm. dmesg willl give you the sector numbers that are bad, so you cna pass thsoe to the --write-sector and --read-sector
Code:

hdparm --write-sector 584452530  /dev/sda
hdparm --read-sector 584452530 /dev/sda

This should force the drive to reallocate the sector. Repeat for each sector. Note that --write-sector is a destructive operation and hdparm will require confirmation this is really you want. Then repeat the sync procedure, if you run into more bad sectors repeat the above procedure with the new sector number until the sync completes

You can use smartctl to see how far gone drive is, however you should probably replace /dev/sda as soon as possible.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43223
Location: 56N 3W

PostPosted: Sat Feb 13, 2016 11:30 pm    Post subject: Reply with quote

kilburna,

You good drive has errors as salahx says.
Code:
[ 4181.879275] sd 0:0:0:0: [sda] tag#11 Add. Sense: Unrecovered read error - auto reallocate failed

Unmount the 'good' drive and use ddrescue to image the good drive onto the replacement drive.
You must run ddrescue with a log file saved on a volume that is neither the source nor destination of the rescue.
The log is used by you to see what is happening and by ddrescue to resume or retry.

ddrescue tries very hard to get one last read out of bad sectors. If it works, it will recover your unreadable data.
If not, the data is gone.

Once ddrescue has done its stuff, replace the source drive.

The output of smartctl -a /dev/... would be good. emerge smartmontools if you need to.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
kilburna
Tux's lil' helper
Tux's lil' helper


Joined: 20 Sep 2004
Posts: 107

PostPosted: Sun Feb 14, 2016 12:24 am    Post subject: Raid 1 sync problem after replacing failed drive [SOLVED] Reply with quote

Thanks salahx and NeddySeagoon for your assistance. I followed salahx advise. There were 3 bad sectors on /dev/sda but eventually the raid synced with /dev/sdb. After that I replaced /dev/sda. All fine.

As a small stat to report, I have had to replace 6 Seagate drives in the last 5 years, whereas I have not yet replaced any Hitachi drives used on other servers in the same period.

Would btrfs raid faired any better than ext4+raid?

Thanks again
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43223
Location: 56N 3W

PostPosted: Sun Feb 14, 2016 12:44 am    Post subject: Reply with quote

kilburna,

You had hardware problems. Nothing can fix that. That what backups are for.
Running a repair every month may provide an early warning of problems.
Keep an eye on the reallocated sector count.

I had two WD Greens in a 5 spindle raid5 set fail within 15 min of one another :(
I've had one Hitachi fail. all over the last 6 years.

Actually, the Hitachi still works but it has a large dead spot 10G from the start.
I use it for an image of my work laptop, that I boot in Virtual Box, so its quite expendable.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum