Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[Solved] Dying RAID5 disk...
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 7132
Location: almost Mile High in the USA

PostPosted: Wed Feb 01, 2017 7:31 am    Post subject: [Solved] Dying RAID5 disk... Reply with quote

I've been nursing a sick disk to milk as much life out of it.

I have a hot spare in my 4+1 RAID5 setup, and had the spare copied to on multiple occasions, as well as manually failing the spare to copy back to the disk. Unfortunately I'm sort of out of disks and using a dissimilar disk (my RAID is SATA, this spare is PATA, and at one point it was firewire attached.)

Alas now, the disk is really slow and mdraid is manually rewriting the bad sectors. But it's still working... very slowly as it's retrying sectors like mad and successful rewriting the failed sectors.

I wonder if I should try my luck failing this disk the last time and hope the other disks are still good, or wait until the disk dies for good and mdraid fails over to the spare one last time... and look into set up a new RAID to replace this system... Not sure if I want to buy another 500G SATA to reload the array.

(Backup is taking forever, due to the rewrites when the reads fail. Wondering if I should fail the bad disk so backup would be faster, but there are still good sectors on the failing disk that could cover the other drives...)

Dilemma!

[SOLVED: Use --replace when you have a sick but not dead disk!]
_________________
Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed watching?


Last edited by eccerr0r on Wed Feb 01, 2017 10:34 am; edited 1 time in total
Back to top
View user's profile Send private message
Akkara
Administrator
Administrator


Joined: 28 Mar 2006
Posts: 6695
Location: &akkara

PostPosted: Wed Feb 01, 2017 8:15 am    Post subject: Reply with quote

Buy a new disk, size it according to what you were planning to use for your new array.

Put that disk in your spare slot (or whichever the going-bad one is). Use it as a 500GB (or whatever your current size) for now. Start a resilver going.

At some point, when you get your new RAID set up, re-format up to its original size, and make it your new spare.
_________________
Many think that Dilbert is a comic. Unfortunately it is a documentary.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43194
Location: 56N 3W

PostPosted: Wed Feb 01, 2017 10:04 am    Post subject: Reply with quote

eccerr0r,

Attach the new disk to the raid *before* you remove/fail one of the 'iffy' one.
That allows you to use the mdadm --replace command. Check the man page', I'm not 100% sure of the command name.

This uses all of the drives in the raid set to generate the data for the new drive before it fails the drive being replaced.
In a 5 spindle raid 5, it means that some combination of 4 out of 5 drives needs to be good for each block, independently.
If you remove, the 'iffy' drive at the start, you are relying on the remaining drives being good everywhere.

You can do the mechanical rearrangement once the raid is up to strength.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 7132
Location: almost Mile High in the USA

PostPosted: Wed Feb 01, 2017 10:32 am    Post subject: Reply with quote

Oh nice, yes, the --replace option is indeed what I need, I was thinking some feature like this would be helpful but had no idea:
mdadm man page wrote:
Code:
       --replace
              Mark listed devices as requiring  replacement.   As  soon  as  a
              spare  is  available,  it  will  be rebuilt and will replace the
              marked device.  This is similar to marking a device  as  faulty,
              but the device remains in service during the recovery process to
              increase  resilience  against  multiple  failures.    When   the
              replacement process finishes, the replaced device will be marked
              as faulty.

I didn't want to outright mark it faulty which is what I'd normally do before --remove a drive because that would leave me with more of a "lose all data" hole with the remaining redundancy-less disks.

This is exactly what I need, thanks!

(Clarification: After some disk failures I was down to a RAID5 4xSATA 500G array
sda sdb sdc sdd

I have a +1 which is a PATA 750G which I partitioned just like the 500Gs.
/dev/sdf (added as spare -- mdadm --add /dev/md1 /dev/sdf2 )

/dev/sdd is the sick disk.

Using:
mdadm --replace /dev/md1 /dev/sdd2

initiated kicking /dev/sdd2 and rebuild to /dev/sdf2 which was marked as spare.

Not to ignore all comments, and to explain the drive gap, /dev/sde is a SATA 2TB disk which I was planning to start building the new array, though I need to get more 2TB disks. Also something sort of unexpected but glad it happened - the RAID speed went up when the replace-rebuild was initiated, probably because it stopped using the sick disk unless it was absolutely necessary.)
_________________
Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Goverp
l33t
l33t


Joined: 07 Mar 2007
Posts: 668

PostPosted: Wed Feb 01, 2017 11:55 am    Post subject: Reply with quote

Digression:
Neddy, adding the new disc before removing the bad one looks very sensible. Now, I have a spare disk for my 4-disk RAID5 array, though I don't need it at the moment. And even if I did, I don't have a 5th SATA socket on my motherboard, so I couldn't do as you suggest. Instead I'd have to rely on my backups. But say I had the spare socket; would it have been better to have already installed the extra disk and use RAID6, and then when the drive went bad, just swap in a new one and let RAID 6 do it's magic?
_________________
Greybeard
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43194
Location: 56N 3W

PostPosted: Wed Feb 01, 2017 12:16 pm    Post subject: Reply with quote

Goverp,

I've never had the luxury of a 'hot spare'.
I'm not sure of the details of the switch over. Drives in my raid sets have always been kicked out of the raid quickly, with no warnings in the log, so its not clear that they could/would be used like a --replace.

That suggests raid 6 is the way to go if you want a spare during the rebuild, or worry about two failures in quick succession. I've hand that in a raid 5. Two within 15 min, even a hot space would not protect against that. My rebuild time on that array is about 6 hours.

Its a balance between cost and downtime. If your backups are both current and validated and you can afford the downtime, you can save the cost of the extra redundancy.
Also, I'm not sure how much a 'hot spare' wears while its a spare. It will be energised, power cycled and may be spun down. It will still be old (in years) when its needed.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 7132
Location: almost Mile High in the USA

PostPosted: Wed Feb 01, 2017 5:25 pm    Post subject: Reply with quote

Now that the thread is "[Solved]" go ahead and rant/comment about RAIDs and dead disks :)

---

Fortunately the drive that was sick had a very unusual signature: It actually still "works" because it can always still be written to. Once freshly written, it can be read back right away (and possibly other blocks can be read just fine) - but this counts as a flaky disk. I don't think that this is a common disk failure, most of the disks that fail for me would suddenly stop being detected as a valid disk and is an obvious candidate for kick and removal without second thought.

I have been running a hot spare for the longest time but after the last failure I ran out of money for 500GB disks and stopped running a hot spare. The hot spare did save me once as it took me a week before I found out a disk had failed since I was too lazy to make md beep the console or something at me when it had a failure. After I removed the dead disk I had been running it without the hot spare.

That is, until I had this second disk flake out in a few months. I added a 750G disk as a "temporary" hot spare and let it fail over to this disk. Now running a dissimilar RAID until I figure out what to do with the 2T disks.

The hot spare indeed would take wear while being "hot". It actually never spins down or md actually constantly tests the hot spare to make sure it doesn't die - so at least the hot spare will be free from the infant death syndrome.

Perhaps I should be running raid6... nah. I think the RAID5 is sufficient. Then again the reason why I went with RAID is more of hating dealing with recreating any lost work from when the backup was made to when the disk fails, even if it's just a day. RAID has closed that hole significantly. Before fully switching to the 2T disks for the primary RAID, I just need to make sure I have a good backup solution; I've been using another 2T disk to backup the 4x500G RAID...
_________________
Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Goverp
l33t
l33t


Joined: 07 Mar 2007
Posts: 668

PostPosted: Thu Feb 02, 2017 11:15 am    Post subject: Reply with quote

NeddySeagoon wrote:
...
Also, I'm not sure how much a 'hot spare' wears while its a spare. It will be energised, power cycled and may be spun down. It will still be old (in years) when its needed.

Good point, and a similar point applies to the extra redundant disk in a RAID 6 setup.
_________________
Greybeard
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43194
Location: 56N 3W

PostPosted: Thu Feb 02, 2017 12:44 pm    Post subject: Reply with quote

Goverp,

That's why you run a repair or a check on a regular basis and keep an eye on logs for any read errors.

If you get read errors during a repair, the redundant data is recalculated and rewritten to the drive with the error.
That should force the drive to relocate the sector. If it works, that's OK. drives are supposed to work that way.
If a drive shows a non zero pending sector count after a repair, the relocate failed for some reason internal to the drive and you have some missing redundant data at that location.
That drive needs to be replaced nowish, or even sooner.

A non zero pending sector count is grounds for warranty replacement. Its not supposed to happen, ever.
Sectors are supposed to be relocated before they can't be read.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 7132
Location: almost Mile High in the USA

PostPosted: Thu Feb 02, 2017 5:02 pm    Post subject: Reply with quote

This is what makes this particular instance different than other mdraid failures that I have seen:

Code:
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0   
197 Current_Pending_Sector  0x0032   200   195   000    Old_age   Always       -       0   


This is after running badblocks on the disk... the Pending sectors was nonzero, and now is zero...
_________________
Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43194
Location: 56N 3W

PostPosted: Thu Feb 02, 2017 5:33 pm    Post subject: Reply with quote

eccerr0r,

That's odd. Nothing was relocated.
It suggests that badblocks was able to read and write the entire disk surface and that the HDD was happy with the writes.

I've not seen that before. I'm not sure I would trust that drive.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 7132
Location: almost Mile High in the USA

PostPosted: Thu Feb 02, 2017 6:17 pm    Post subject: Reply with quote

Indeed, I don't trust this device anymore. Seems like the data on the disk has a half life as you can always read it back right away or within a few minutes.
_________________
Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum