Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[s] raid10 “recovery aborted due to read error” drives spare
View unanswered posts
View posts from last 24 hours

Goto page 1, 2  Next  
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
maiku
Guru
Guru


Joined: 24 Mar 2004
Posts: 575
Location: Long Island, NY

PostPosted: Fri Sep 11, 2015 4:17 pm    Post subject: [s] raid10 “recovery aborted due to read error” drives spare Reply with quote

I'm trying to recover my RAID10 array. All devices are plugged into a PCI controller that I've been using for a number of years now. I just replaced all of the drives in the system, as they all seemed to go bad one after another, slowly. So all of the drives should be new-ish.

I am trying to re-attach two drives into my RAID10 array, since I had to reboot with a live disc and the live disc only added 2 drives to my RAID10 and not all four. However, they are being added as spares so the RAID10 looks like this [U__U], a big droopy-eyed frowny face. This is exactly how it's making me feel. I am trying to recover it again, but it may inevitably fail.

Only two drives working our of 4 in a RAID10 seems rather dangerous. Will it never rebuild with that "bad sector?" Could it be the controller, the cable, or the drive going bad? Are there any tell-tale signs for each to easily eliminate this?

Current mdstat:

Quote:
md2 : active raid10 sdb1[4] sdc1[5] sdd1[3] sda1[0]
3907023872 blocks 64K chunks 2 near-copies [4/2] [U__U]
[=>...................] recovery = 8.1% (159186432/1953511936) finish=734.4min speed=40720K/sec


dmesg drive failure

Quote:
[35402.197995] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
[35402.197999] ata1.00: irq_stat 0x00020002, device error via SDB FIS
[35402.198002] ata1.00: failed command: READ FPDMA QUEUED
[35402.198007] ata1.00: cmd 60/08:00:bf:90:2e/00:00:d2:00:00/40 tag 0 ncq 4096 in
res 41/40:00:bf:90:2e/00:00:d2:00:00/40 Emask 0x409 (media error) <F>
[35402.198009] ata1.00: status: { DRDY ERR }
[35402.198010] ata1.00: error: { UNC }
[35402.205650] ata1.00: configured for UDMA/100
[35402.205662] sd 0:0:0:0: [sda] Unhandled sense code
[35402.205663] sd 0:0:0:0: [sda]
[35402.205665] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[35402.205666] sd 0:0:0:0: [sda]
[35402.205667] Sense Key : Medium Error [current] [descriptor]
[35402.205669] Descriptor sense data with sense descriptors (in hex):
[35402.205670] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[35402.205677] d2 2e 90 bf
[35402.205680] sd 0:0:0:0: [sda]
[35402.205681] Add. Sense: Unrecovered read error - auto reallocate failed
[35402.205683] sd 0:0:0:0: [sda] CDB:
[35402.205684] Read(10): 28 00 d2 2e 90 bf 00 00 08 00
[35402.205690] end_request: I/O error, dev sda, sector 3526267071
[35402.205700] md/raid10:md2: recovery aborted due to read error
[35402.205702] ata1: EH complete


The above appears a bunch of times then it has the:

dmesg RAID failure

Quote:
[35402.205751] md: md2: recovery done.
[35403.181610] RAID10 conf printout:
[35403.181613] --- wd:2 rd:4
[35403.181615] disk 0, wo:0, o:1, dev:sda1
[35403.181616] disk 1, wo:1, o:1, dev:sdb1
[35403.181617] disk 3, wo:0, o:1, dev:sdd1
[35403.196012] RAID10 conf printout:
[35403.196013] --- wd:2 rd:4
[35403.196014] disk 0, wo:0, o:1, dev:sda1
[35403.196015] disk 3, wo:0, o:1, dev:sdd1
[35403.196020] RAID10 conf printout:
[35403.196020] --- wd:2 rd:4
[35403.196021] disk 0, wo:0, o:1, dev:sda1
[35403.196022] disk 3, wo:0, o:1, dev:sdd1
[35403.196023] RAID10 conf printout:
[35403.196024] --- wd:2 rd:4
[35403.196025] disk 0, wo:0, o:1, dev:sda1
[35403.196026] disk 3, wo:0, o:1, dev:sdd1
[75020.846014] svc: 10.0.0.157, port=886: unknown version (4 for prog 100003, nfsd)
[75020.978949] svc: 10.0.0.157, port=787: unknown version (4 for prog 100003, nfsd)
[75021.280311] svc: 10.0.0.157, port=804: unknown version (4 for prog 100003, nfsd)
[82135.812388] nfsd: last server has exited, flushing export cache
[82377.922445] md/raid10:md2: Disk failure on sdb1, disabling device.
md/raid10:md2: Operation continuing on 2 devices.

_________________
Michael A. Leonetti
As warm as green tea


Last edited by maiku on Thu Sep 17, 2015 8:18 pm; edited 1 time in total
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43210
Location: 56N 3W

PostPosted: Fri Sep 11, 2015 5:03 pm    Post subject: Reply with quote

maiku,

Two drives is the minimum for raid10. That its still working, shows its down to raid0.
Raid10 protects against all failures of one drive and some failures of two drives.
That means that there is no redundant data. In turn, if mdadm cannot read whats left it won't rebuild.

It may be the SATA data cables, the card, the drive or even your PSU.

Install smartmontools and post the output of
Code:
smartctl -a /dev/sd...
for each drive.
That will tell something of the internal status of each drive.

Code:
[35402.205681] Add. Sense: Unrecovered read error - auto reallocate failed
[35402.205690] end_request: I/O error, dev sda, sector 3526267071

Is a bad sign, it looks like the drive cannot read its own writing well enough to reallocate the sector at 3526267071.

Do you have space to make an image of /dev/sda ?
A file would do but a HDD would be better.

I wonder what data is at sector 3526267071.
If its unallocated space, you won't loose any data.
If it a piece of a file, the file is lost.
If its a piece of a directory, several files are lost.
If its filesystem metadata ... it gets worse

The trick is to have that sector read one more time, so it gets relocated.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
frostschutz
Advocate
Advocate


Joined: 22 Feb 2005
Posts: 2970
Location: Germany

PostPosted: Fri Sep 11, 2015 5:28 pm    Post subject: Reply with quote

A disk can not relocate what it can not read; you have to write something new to it. Thus - disks with relocated sectors usually means disks that lost data. You should not trust such disks anymore, and particularly not keep them in a RAID. Make an image with ddrescue; if there are holes you might try to fill them with the data of the RAID mirror of that disk, if those disks still exist. If you're very lucky you might get away without data loss. If you're unlucky - hope you have a good backup solution.

If you have many disks you should set up monitoring. Regular SMART self-tests. It should send you mail as soon as a problem is detected and you should react and replace faulty disks. RAID survival depends on healthy disks; if you don't notice broken disks or leave broken disks in the hope that the other disks will make it work...
Back to top
View user's profile Send private message
maiku
Guru
Guru


Joined: 24 Mar 2004
Posts: 575
Location: Long Island, NY

PostPosted: Fri Sep 11, 2015 6:44 pm    Post subject: Reply with quote

Some new information, it seems like SDD is now going through a similar struggle. All of my drives can't just be bad.

Quote:
[86460.551072] ata4.00: failed command: READ FPDMA QUEUED
[86460.551076] ata4.00: cmd 60/00:e0:3f:cd:7f/04:00:15:00:00/40 tag 28 ncq 524288 in
res 13/1a:01:01:00:00/00:00:00:c0:13/00 Emask 0x2 (HSM violation)
[86460.551077] ata4.00: status: { ERR }
[86460.551078] ata4.00: error: { IDNF }
[86460.551080] ata4.00: failed command: READ FPDMA QUEUED
[86460.551083] ata4.00: cmd 60/00:e8:3f:d1:7f/04:00:15:00:00/40 tag 29 ncq 524288 in
res 13/1a:01:01:00:00/00:00:00:d0:13/00 Emask 0x2 (HSM violation)
[86460.551085] ata4.00: status: { ERR }
[86460.551086] ata4.00: error: { IDNF }
[86460.551087] ata4.00: failed command: READ FPDMA QUEUED
[86460.551091] ata4.00: cmd 60/00:f0:3f:d5:7f/04:00:15:00:00/40 tag 30 ncq 524288 in
res 13/1a:01:01:00:00/00:00:00:e0:13/00 Emask 0x2 (HSM violation)
[86460.551092] ata4.00: status: { ERR }
[86460.551093] ata4.00: error: { IDNF }
[86460.551097] ata4: hard resetting link
[86462.940025] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
[86462.974743] ata4.00: configured for UDMA/100
[86462.974818] sd 3:0:0:0: [sdd] Unhandled sense code
[86462.974819] sd 3:0:0:0: [sdd]
[86462.974821] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[86462.974822] sd 3:0:0:0: [sdd]
[86462.974823] Sense Key : Hardware Error [current] [descriptor]
[86462.974825] Descriptor sense data with sense descriptors (in hex):
[86462.974826] 72 04 00 00 00 00 00 0c 00 0a 80 00 00 00 12 00
[86462.974831] 00 00 00 01
[86462.974833] sd 3:0:0:0: [sdd]
[86462.974835] Add. Sense: No additional sense information
[86462.974836] sd 3:0:0:0: [sdd] CDB:
[86462.974837] Read(10): 28 00 15 7f 92 bf 00 04 00 00
[86462.974842] end_request: I/O error, dev sdd, sector 360682175
[86462.974867] sd 3:0:0:0: [sdd] Unhandled sense code
[86462.974868] sd 3:0:0:0: [sdd]
[86462.974869] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[86462.974870] sd 3:0:0:0: [sdd]


Is it basically "time to get a new controller" time?
_________________
Michael A. Leonetti
As warm as green tea
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43210
Location: 56N 3W

PostPosted: Fri Sep 11, 2015 7:02 pm    Post subject: Reply with quote

maiku,

Disconnect each SATA data cable in turn and reconnect it.
Poor quality SATA data cables can do this.
Do both ends.

I don't think sda is data cable related though.

While you are in the case, look at the power distribution to the drives.
No more that two drives should be on the same cable from the PSU.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
maiku
Guru
Guru


Joined: 24 Mar 2004
Posts: 575
Location: Long Island, NY

PostPosted: Sat Sep 12, 2015 2:48 am    Post subject: Reply with quote

NeddySeagoon wrote:
Install smartmontools and post the output of
Code:
smartctl -a /dev/sd...
for each drive.
That will tell something of the internal status of each drive.
Reports this:
Code:
# smartctl -a /dev/sda
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-3.5.7-gentoo] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (Adv. Format)
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WMAZ20125146
LU WWN Device Id: 5 0014ee 6002d1d6b
Firmware Version: 50.0AB50
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Fri Sep 11 22:45:50 2015 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (36600) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   172   167   021    Pre-fail  Always       -       6375
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       58
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   041   041   000    Old_age   Always       -       43666
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       57
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       36
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       1423838
194 Temperature_Celsius     0x0022   123   102   000    Old_age   Always       -       27
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   197   196   000    Old_age   Always       -       1293
198 Offline_Uncorrectable   0x0030   200   199   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   095   000    Old_age   Offline      -       24

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     43645         -
# 2  Short offline       Completed without error       00%     43621         -
# 3  Short offline       Completed without error       00%     43597         -
# 4  Short offline       Completed without error       00%     43573         -
# 5  Short offline       Completed without error       00%     43549         -
# 6  Short offline       Completed without error       00%     43525         -
# 7  Extended offline    Completed: read failure       20%     43508         3526234744
# 8  Short offline       Completed without error       00%     43501         -
# 9  Short offline       Completed without error       00%     43477         -
#10  Short offline       Completed without error       00%     43453         -
#11  Short offline       Completed without error       00%     43429         -
#12  Short offline       Completed without error       00%     43405         -
#13  Short offline       Completed without error       00%     43381         -
#14  Short offline       Completed without error       00%     43357         -
#15  Extended offline    Completed without error       00%     43341         -
#16  Short offline       Completed without error       00%     43334         -
#17  Short offline       Completed without error       00%     43310         -
#18  Short offline       Completed without error       00%     43286         -
#19  Short offline       Completed without error       00%     43262         -
#20  Short offline       Completed without error       00%     43238         -
#21  Short offline       Completed without error       00%     43214         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Swapped quote to code tage above for easy reading - NeddySeagoon

NeddySeagoon wrote:
Do you have space to make an image of /dev/sda ?
A file would do but a HDD would be better.
I can make an image of it onto another disk. It keeps failing at the same sector. It just failed again. I'll try reconnecting the SATAs tomorrow, but this is likely a real bad sector error. Should I try copying my files, or is there a way I can somehow copy this disk completely to another disc and try to replace it in the RAID array?
_________________
Michael A. Leonetti
As warm as green tea
Back to top
View user's profile Send private message
russK
l33t
l33t


Joined: 27 Jun 2006
Posts: 630

PostPosted: Sat Sep 12, 2015 6:37 am    Post subject: Reply with quote

In years past I had decent luck recovering data with ddrescue, perhaps you can use ddrescue to make an image of the drive or partition.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43210
Location: 56N 3W

PostPosted: Sat Sep 12, 2015 10:02 am    Post subject: Reply with quote

maiku,

Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   197   196   000    Old_age   Always       -       1293


Thats a really bad sign. The drive knows about 1293 sectors that it would like to relocate but can't because it can't read them.
There may be more. That drive is scrap.

Ewww ...
Code:
Model Family:     Western Digital Caviar Green
I have five of those in a raid5. Two failed within 15 min of one another at 9 months old.
I was lucky. I lost one data block that was either spare or in the middle of my DVD collection.
I didn't check your warrany status, I'll leave that for you. I got two warranty replacements.

Get ddrescue and try to make an image onto another drive. If your drive is still in warranty, WD will send you a replacement before you return the dud.
ddrescue will try very hard to read your data and you only need one more read so it can write the image.
Be sure to write the log file. ddrescue need not be run all in one go and it will read the log to see what its done.

Depending on the actual problem in the drive, it can help to operate the defective drive upside down, on each edge in turn and any angles you like in between.
The idea is that if you have worn bearings and things are moving around, they will line up one more time.
If ddrescuce recovers all your data the Current_Pending_Sector will be zero and the drive will look healthy again.
You need to use the smart data above to demonstrate that the drive is faulty if you qualify for a warrany replacement.

A ddrescue log file looks like
Code:
# Rescue Logfile. Created by GNU ddrescue version 1.15
# Command line: ddrescue -M -d -b 4096 -r 16 -f /dev/sde /dev/sda /root/copy_log.txt
# current_pos  current_status
0x1844FF15000     +
#      pos        size  status
0x00000000  0x1800933D000  +
0x1800933D000  0x00001000  -
0x1800933E000  0x8FBFA000  +
0x18098F38000  0x00001000  -
0x18098F39000  0x3B6FD9000  +
0x1844FF12000  0x00004000  -
0x1844FF16000  0x4D71200000  +


The + means the data has meen recovered.
The - means that the area could not be read (yet) and there is a hole in the image.
Values are bytes, so are is six disk blocks (each block is 4k) still to recover.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
maiku
Guru
Guru


Joined: 24 Mar 2004
Posts: 575
Location: Long Island, NY

PostPosted: Sat Sep 12, 2015 1:24 pm    Post subject: Reply with quote

It doesn't help that my western digital drive is now also doing the same thing! I think I need to start praying.


NeddySeagoon wrote:
A ddrescue log file looks like
Code:
# Rescue Logfile. Created by GNU ddrescue version 1.15
# Command line: ddrescue -M -d -b 4096 -r 16 -f /dev/sde /dev/sda /root/copy_log.txt


Should I be using ddrescue to mark the seconds as good or ddrescue to make an image of the failing drives? Which is better?

Code:
# smartctl -a /dev/sdd
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-3.5.7-gentoo] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (Adv. Format)
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WMAZ20147675
LU WWN Device Id: 5 0014ee 6002d2051
Firmware Version: 50.0AB50
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sat Sep 12 09:23:13 2015 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 118) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline
data collection:                (36000) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   199   199   051    Pre-fail  Always       -       1150
  3 Spin_Up_Time            0x0027   173   168   021    Pre-fail  Always       -       6316
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       66
  5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always       -       2
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   049   049   000    Old_age   Always       -       37713
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       64
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       38
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       1437963
194 Temperature_Celsius     0x0022   121   105   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   198   198   000    Old_age   Always       -       2
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       94
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       71
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   184   000    Old_age   Offline      -       10

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       60%     37710         1856004768
# 2  Short offline       Completed without error       00%     37684         -
# 3  Short offline       Completed without error       00%     37661         -
# 4  Short offline       Completed without error       00%     37637         -
# 5  Short offline       Completed without error       00%     37613         -
# 6  Short offline       Completed without error       00%     37589         -
# 7  Short offline       Completed without error       00%     37565         -
# 8  Extended offline    Completed: read failure       60%     37544         1856004768
# 9  Short offline       Completed without error       00%     37541         -
#10  Short offline       Completed without error       00%     37517         -
#11  Short offline       Completed without error       00%     37493         -
#12  Short offline       Completed without error       00%     37469         -
#13  Short offline       Completed without error       00%     37445         -
#14  Short offline       Completed without error       00%     37421         -
#15  Short offline       Completed without error       00%     37397         -
#16  Extended offline    Completed: read failure       60%     37376         1856004768
#17  Short offline       Completed without error       00%     37373         -
#18  Short offline       Completed without error       00%     37349         -
#19  Short offline       Completed without error       00%     37325         -
#20  Short offline       Completed without error       00%     37301         -
#21  Short offline       Completed without error       00%     37277         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Swapped quote to code tags above for easf reading - NeddySeagoon
_________________
Michael A. Leonetti
As warm as green tea
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43210
Location: 56N 3W

PostPosted: Sat Sep 12, 2015 1:48 pm    Post subject: Reply with quote

maiku,

Use ddrescue to make an image of the drive.
As a consequence of making the image, any time ddrescue reads one of the blocks the drive wants to relocate, the drive will relocate it to a spare sector.
Every time that happens, the
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
Reallocated_Sector_Ct value will increase.

If Current_Pending_Sector count ever gets to zero, your data has been completely recovered, until the next time.
Replace the drive, since you don't know when that will be.

sdd in sick in the same way.
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   199   199   140    Pre-fail  Always       -       2
196 Reallocated_Event_Count 0x0032   198   198   000    Old_age   Always       -       2
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       94


Your warranty on both drives expired on 6/26/2013. If you are in the USA anyway.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
maiku
Guru
Guru


Joined: 24 Mar 2004
Posts: 575
Location: Long Island, NY

PostPosted: Sat Sep 12, 2015 8:14 pm    Post subject: Reply with quote

I'm running ddrescue right now. It's doing its thing. I have a stupid question perhaps. Will there be any way to tell what files were potentially corrupt or what was corrupt by the bad sectors? I just would rather find out sooner than later. I don't want to be showing my kids wedding photos and then all of the sudden my wife's favourite photos are corrupt (time reference: just got married, no kids yet).
_________________
Michael A. Leonetti
As warm as green tea
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43210
Location: 56N 3W

PostPosted: Sat Sep 12, 2015 8:38 pm    Post subject: Reply with quote

maiku,

The simple answer is yes but its not easy.
For extX filesystems, you can find out what sectors belong to which i-node. Thats not too bad on a single disk setup.
You need to take account of how things are allocated through the raid layer too. That's harder.

Once ddrescue is done, if its not recovered everything, you can try copying all the files off the raid set.
That will only succeed if no files have unreadable sectors. So if that works, the holes in your disk image are in unallocated space.

Oh, ddrescue won't stop until it has recovered everything. Keep an eye on the log.
When there are no lines ending in ? its got all the easy to read data and is doing retries.
Then its time to move the drive around a bit.

Keep an eye on the Current_Pending_Sector count too. It can go up as well as down.

If you know the lost data was written a long time ago and never changed. It may still be on the part of the mirror thats been dropped out of the raid.
You may be able to fill in the holes in the image with that drive and a hex editor. Again, it depends on knowing how the data is laid out on the raid.
You can test by making comparisions with data blocks you did recover.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
maiku
Guru
Guru


Joined: 24 Mar 2004
Posts: 575
Location: Long Island, NY

PostPosted: Sat Sep 12, 2015 8:51 pm    Post subject: Reply with quote

I forgot to mention, the data on the RAID array is part of an LVM which is encrypted using luks. If I don't recover the missing 200 or so sectors, will this make everything moot? Should I start trying to copy files instead? The sda drive keeps going missing and I keep having to restart ddrescue. I need to pay attention to those smart e-mails I've been getting.
_________________
Michael A. Leonetti
As warm as green tea
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43210
Location: 56N 3W

PostPosted: Sat Sep 12, 2015 9:19 pm    Post subject: Reply with quote

maiku,

I'm not sure I fully understand the layering.

You have four drives with partitions in raid10.
On top the the raid10 you have LVMs
Some or all of the logical volumes are then encrypted with LUKS.

You won't loose any more than you have already. I am not very familiar with LUKS.
I think it uses a block cipher, in which case you will loose (have lost) whole cipher blocks.
I don't know the cipher block size.

Raid10 is raid0 over two raid1s or raid1 over two raid0s. I forget which.
This means in a healthy raid10 you how two identical copies of everything.
You *may* be able to fill the holes with the old out of date data but with LUKS, its more difficult.
If you are really lucky and the cipher block size is less than the drive block size, the use of LUKS just means you won't be able to read anything in a hex editor but it won't stop sector compares.

sda is not part of a mounted filesystem while you are running ddrescue is it?
If so, start again. ddrescue needs the drive to itself.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
maiku
Guru
Guru


Joined: 24 Mar 2004
Posts: 575
Location: Long Island, NY

PostPosted: Sat Sep 12, 2015 11:21 pm    Post subject: Reply with quote

You are correct in understanding the layering.

1) It's a RAID10, which is a RAID0 of RAID1's. This is RAID volume md2.
2) md2 is part of a logical volume group with md3, which is another RAID of mirrors. md3 is heatlhy. md2 (the RAID10) is the one with issues. This logical volume is called "blob".
3) Blob is then encrypted with luks to produce the logical volume "crypt-blob."
4) blob-crypt is formatted ext4.

/dev/sda, the really bad drive in question, is part of md2, but currently the logical volume is stopped and the RAID is stopped as well. Nothing is mounted of these drives. The current stats read:
Quote:
GNU ddrescue 1.18.1
About to copy 2000 GBytes from /dev/sdg to /dev/sdb
Starting positions: infile = 0 B, outfile = 0 B
Copy block size: 16 sectors Initial skip size: 16 sectors
Sector size: 4096 Bytes

Press Ctrl-C to interrupt
Initial status (read from logfile)
rescued: 1082 GB, errsize: 7012 kB, errors: 99

Current status
rescued: 1103 GB, errsize: 7077 kB, current rate: 40435 kB/s
ipos: 1103 GB, errors: 100, average rate: 46564 kB/s
opos: 1103 GB, run time: 7.41 m, successful read: 0 s ago
Copying non-tried blocks... Pass 1 (forwards)
So you're saying that ddrescue will attempt to get all of the errored sectors ad infinitum and I'll have to manually Ctrl-C out of it at one point?
_________________
Michael A. Leonetti
As warm as green tea


Last edited by maiku on Sun Sep 13, 2015 6:57 pm; edited 1 time in total
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43210
Location: 56N 3W

PostPosted: Sun Sep 13, 2015 12:20 am    Post subject: Reply with quote

maiku,

Quote:
So you're saying that ddrescue will attempt to get all of the errored sectors ad infinitum and I'll have to manually Ctrl-C out of it at one point

Yes. ddrescue has several ways to sneak up on hard to read sectors. How hard it tries depends on the command line.
I would expect you to need at least six goes as you move the drive around.

As long as you keep giving ddrescue the same input, output and log files, it will only retry unread areas.

As the output says
Quote:
Copying non-tried blocks... Pass 1 (forwards)
ddrescue is presently working like dd but not stopping on errors.
It will read the working blocks first, then come back to the errors. Its skipping them just now.
It will get a lot slower.

Read the ddrescue log. When evry line ends with either a + (recovered) or a - (unread) you may as well stop ddrescue but you are not done yet.
Turn the drive over/on edge and tell ddrescue to try again. From memory, second and subsequent runs nee some extra parameters.

Read the man page --try-again, --retry-passes= --retrim all look useful.
It seems you don't get any retries unless you ask now.

The unreadable sector count was 1293, and each sector is 4k.
In the log each 4k sector is 1000 bytes.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
maiku
Guru
Guru


Joined: 24 Mar 2004
Posts: 575
Location: Long Island, NY

PostPosted: Sun Sep 13, 2015 12:57 am    Post subject: Reply with quote

ddrescue finished on sdd. There were "3" errors. It just said "finished." This was the output.
Quote:
# Rescue Logfile. Created by GNU ddrescue version 1.18.1
# Command line: ddrescue -M -d -b 4096 -r 16 -f -v /dev/sdd /dev/sdc sdd.recover.logfile
# Start time: 2015-09-12 10:19:47
# Current time: 2015-09-12 20:34:26
# Finished
# current_pos current_status
0xE57FE3D000 +
# pos size status
0x00000000 0xDD40C54000 +
0xDD40C54000 0x00002000 -
0xDD40C56000 0x3F5FE000 +
0xDD80254000 0x00001000 -
0xDD80255000 0x7FFBE8000 +
0xE57FE3D000 0x00001000 -
0xE57FE3E000 0xEC412D8000 +
ddrescue output
Quote:
# ddrescue -M -d -b 4096 -r 16 -f -v /dev/sdd /dev/sdc sdd.recover.logfile


GNU ddrescue 1.18.1
About to copy 2000 GBytes from /dev/sdd to /dev/sdc
Starting positions: infile = 0 B, outfile = 0 B
Copy block size: 16 sectors Initial skip size: 16 sectors
Sector size: 4096 Bytes

Press Ctrl-C to interrupt
rescued: 2000 GB, errsize: 16384 B, current rate: 0 B/s
ipos: 985693 MB, errors: 3, average rate: 54242 kB/s
opos: 985693 MB, run time: 10.24 h, successful read: 3.01 m ago
Finished
Does this mean that there are still bad sectors?

Edit
I can see that it does. I re-ran the program. There are still 3 bad sectors. I want to get the drives out to try them in different directions. However, if I leave any of the drives plugged in when the computer starts up, it'll try and recreate the RAID and I'm sure start overwriting stuff. I can't unplug them now that the computer is on because they are all in the same caddy as the good drives. I have to unscrew and take them all out and I don't want to do that while they are all running.

I can try leaving them unplugged, turning on the machine, and then plugging them in one at a time, praying that mdadm does not automatically make them into a RAID array when they get detected.
_________________
Michael A. Leonetti
As warm as green tea
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43210
Location: 56N 3W

PostPosted: Sun Sep 13, 2015 10:41 am    Post subject: Reply with quote

maiku,

ddrescue got it all but four sectors.
Code:
0xDD40C54000 0x00002000 -
0xDD80254000 0x00001000 -
0xE57FE3D000 0x00001000 -


You should find that the
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
is no longer zero and
Code:
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -     1293
is now 4.
It might not be as the drive may have decided it doesn't need to reallocate some of the sectors its read.

Thase sectors are well down the drive. 0xDD40C54000 is the start address of the first error.
Depending on the data layout on the drive and how full the raid is, they may be empty.
LUKS won't be happy with that, since empty space is encrypted on the drive. Getting them back will make life easier.

Converting hex 0xDD40C54000 to decimal is left as an exercise for the reader :)
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
maiku
Guru
Guru


Joined: 24 Mar 2004
Posts: 575
Location: Long Island, NY

PostPosted: Sun Sep 13, 2015 12:56 pm    Post subject: Reply with quote

I think LUKS will be less thrilled that sda has currently 176 errors totaling over 12MB. If it's all stuff I can get back (ie any YouTube video I have downloaded or something) I could care less. But even as I do the recovery my RAID is secretly being detected.
Quote:
# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
md2 : inactive sdc1[3](S) sda1[0](S) sdb[5](S)
5860538368 blocks
It's inactive so it's not trying to operate the drives and ruin my recovery, is it? I still have about 80 GB left to go.

Edit
奇跡が起こった!

Look at this! sda finished with this status!
Quote:
ddrescue -M -d -b 4096 -r 16 -f -v /dev/sdg /dev/sdb sda.recover.logfile


GNU ddrescue 1.18.1
About to copy 2000 GBytes from /dev/sdg to /dev/sdb
Starting positions: infile = 0 B, outfile = 0 B
Copy block size: 16 sectors Initial skip size: 16 sectors
Sector size: 4096 Bytes

Press Ctrl-C to interrupt
Initial status (read from logfile)
rescued: 1906 GB, errsize: 12451 kB, errors: 175

Current status
rescued: 2000 GB, errsize: 4096 B, current rate: 0 B/s
ipos: 1977 GB, errors: 1, average rate: 44295 kB/s
opos: 1977 GB, run time: 35.51 m, successful read: 3 s ago
Finished
That means only one bad block right! I just have a little more to recover! YESSSSSSSS!
_________________
Michael A. Leonetti
As warm as green tea
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43210
Location: 56N 3W

PostPosted: Sun Sep 13, 2015 2:04 pm    Post subject: Reply with quote

maiku,

Yes, just one more block to go on sda.
The log will tell you where it is.

Code:
md2 : inactive sdc1[3](S) sda1[0](S) sdb[5](S)

Thats three spare drives in md2.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
maiku
Guru
Guru


Joined: 24 Mar 2004
Posts: 575
Location: Long Island, NY

PostPosted: Sun Sep 13, 2015 6:56 pm    Post subject: Reply with quote

Well, it seems that we are all done. I rotated the drive as much as possible. I'm not sure what could have been corrupt. I ran fsck on the volume and it gave no errors. Nor does it bother to make a big deal of anything. Since the RAID array is also part of a LV, I'm not really sure how to map the sectors that couldn't copy to the files themselves. The fsck output is:
Quote:
fsck -f -v /dev/mapper/crypt-blob
fsck from util-linux 2.25.2
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

318551 inodes used (0.23%, out of 139616256)
5630 non-contiguous files (1.8%)
127 non-contiguous directories (0.0%)
# of inodes with ind/dind/tind blocks: 0/0/0
Extent depth histogram: 317169/1339/14
871834898 blocks used (78.06%, out of 1116910080)
0 bad blocks
218 large files

303388 regular files
15133 directories
2 character device files
0 block device files
0 fifos
0 links
17 symbolic links (17 fast symbolic links)
2 sockets
------------
318542 files
I guess maybe some of the free space was part of the bad sectors? I'm really not sure. One day I may go to open something old and notice that it's broken. I guess this means I'm done until I get the replacement drives?
_________________
Michael A. Leonetti
As warm as green tea
Back to top
View user's profile Send private message
frostschutz
Advocate
Advocate


Joined: 22 Feb 2005
Posts: 2970
Location: Germany

PostPosted: Sun Sep 13, 2015 7:22 pm    Post subject: Reply with quote

If you have access to your files now would be the time to make backups.

If it's ext* I think you can use debugfs to find which files are belong to a given block #. You'll have to include raid metadata offset and luks header offset in your calculations.
Back to top
View user's profile Send private message
maiku
Guru
Guru


Joined: 24 Mar 2004
Posts: 575
Location: Long Island, NY

PostPosted: Sun Sep 13, 2015 11:26 pm    Post subject: Reply with quote

I'm trying to figure out how I could possibly do that and what math would be involved. First, the RAID is RAID10 so it's striped. Then it's in an LV with other drives, but I think it was the first one in the group. Then LUKS is on top of that. Not even sure how that would work.
_________________
Michael A. Leonetti
As warm as green tea
Back to top
View user's profile Send private message
frostschutz
Advocate
Advocate


Joined: 22 Feb 2005
Posts: 2970
Location: Germany

PostPosted: Sun Sep 13, 2015 11:34 pm    Post subject: Reply with quote

LUKS header you can check with cryptsetup luksDump, it used to be an odd number but recently created LUKS containers simply use a 2MiB offset.

LVM has a 1MiB offset and with lvs -o +devices or something like that, I don't quite remember right now, you should see which LV is where.

MD data offset is in mdadm --examine.
Back to top
View user's profile Send private message
maiku
Guru
Guru


Joined: 24 Mar 2004
Posts: 575
Location: Long Island, NY

PostPosted: Sun Sep 13, 2015 11:46 pm    Post subject: Reply with quote

Funny enough I can run a --detail on the drives but not an examine. Is that somehow bad?
Quote:
youjinbou ~ # mdadm --examine /dev/md2
mdadm: No md superblock detected on /dev/md2.
youjinbou ~ # mdadm --detail /dev/md2
/dev/md2:
Version : 0.90
Creation Time : Sat Aug 21 23:24:36 2010
Raid Level : raid10
Array Size : 3907023872 (3726.03 GiB 4000.79 GB)
Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
Raid Devices : 4
Total Devices : 2
Preferred Minor : 2
Persistence : Superblock is persistent

Update Time : Sun Sep 13 19:20:51 2015
State : clean, degraded
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0

Layout : near=2
Chunk Size : 64K

UUID : 516e42c8:a0719c72:21dc33df:a8a9fd27 (local to host youjinbou)
Events : 0.175927

Number Major Minor RaidDevice State
0 8 1 0 active sync set-A /dev/sda1
2 0 0 2 removed
4 0 0 4 removed
3 8 17 3 active sync set-B /dev/sdb1

_________________
Michael A. Leonetti
As warm as green tea


Last edited by maiku on Sun Sep 13, 2015 11:49 pm; edited 1 time in total
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum