View previous topic :: View next topic |
Author |
Message |
rubyg n00b

Joined: 22 Sep 2016 Posts: 1
|
Posted: Thu Sep 22, 2016 5:49 pm Post subject: RAID cluster Drive Failure |
|
|
I apologize in advance that I am very new to this and so am struggling with some very potentially basic issues. I have inherited a cluster at work that is having some issues. Sadly, the person who set up the cluster is no longer working here and will not reply to emails so is unwilling to explain how he set up this system. I have been left with exactly 1 page of notes and am trying to piece the rest together with my limited knowledge.
The system is a RAID10 cluster in a SuperMicro manifold. The cluster was set up using rocks, per the notes I have, but the current linux distribution appears to be Gentoo (as per /proc/version). From lspci is appears the crontroller is an Areca ARC-1880 8/12 port PCIe/PCI-X to SAS/SATA II RAID Controller.
From the controller configuration BIOS I can see that there are three volume sets (If someone could explain what this means I would greatly appreciate it).
I hope the above is sufficient information to start tracking down the issue I am having or at least for someone to be able to point me in the direction of someone else who might be able to help.
Recently I have been getting I/O errors in the dmesg log telling me that I have a drive (sda2) failing:
Code: | 261894.168151] Write(10): 2a 00 32 3f 00 b0 00 03 78 00
[261894.168160] end_request: I/O error, dev sda, sector 842989744
[261894.168190] md/raid1:md2: [b]Disk failure on sda2[/b], disabling device.
md/raid1:md2: Operation continuing on 1 devices.
[261894.168206] sd 1:0:0:0: [sda] Unhandled error code
[261894.168209] sd 1:0:0:0: [sda]
[261894.168210] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[261894.168211] sd 1:0:0:0: [sda] CDB:
[261894.168212] Write(10): 2a 00 3d de 49 90 00 00 08 00
[261894.168217] end_request: [b]I/O error, dev sda, sector 1037978000[/b]
[261894.168222] sd 1:0:0:0: [sda] Unhandled error code
[261894.168224] sd 1:0:0:0: [sda]
[261894.168225] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[261894.168227] sd 1:0:0:0: [sda] CDB:
[261894.168227] Write(10): 2a 00 4a 82 41 88 00 00 08 00
[261894.168232] end_request: I/O error, dev sda, sector 1250050440
[261894.168236] sd 1:0:0:0: [sda] Unhandled error code |
I first noticed these errors because they put the entire system into a read-only file system such that I can not even reboot the system (is there a way to make the system all rebooting again? I have tried logging into the root but even then it will not find the command).
When I run 'cat /proc/mdstat' I can see that one of the drives is failed:
Code: | $ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
md1 : active raid1 sdb1[1] sda1[0]
204736 blocks [2/2] [UU]
md2 : active raid1 [b]sda2[2](W)(F)[/b] sdb2[1](W)
976556672 blocks[b] [2/1] [_U][/b]
unused devices: <none> |
When I reboot the cluster this drive (sda2) will recover (it takes ~14-16 hours for this to complete).
I have looked in the Controller BIOS event log and there are not events corresponding to the failure being detected by the operating system. In the RAID manifold there are LED lights that are supposed to indicate and mark failed drives but the LEDs are not showing any problems.
Based on this, is there a physical drive failing or what appears to be happening? I know this is a very open ended question and again apologize for this.
If there is additional information that may be helpful I am happy to provide it. I have root control of the cluster and full physical access, as well.
Thanks so much in advance for any and all assistance with this.
Sincerely,
Ruby
Formatting fixed and code tags added by NeddySeagoon |
|
Back to top |
|
 |
frostschutz Advocate


Joined: 22 Feb 2005 Posts: 2971 Location: Germany
|
Posted: Thu Sep 22, 2016 11:00 pm Post subject: |
|
|
so basically it's software raid, never mind the controller... you should be clear which type of raid you're using lest you have two systems fighting one another
looks like physical drive failure, smartctl -a for your drives?
please use code tags for output |
|
Back to top |
|
 |
NeddySeagoon Administrator


Joined: 05 Jul 2003 Posts: 44178 Location: 56N 3W
|
Posted: Fri Sep 23, 2016 9:41 am Post subject: |
|
|
rubyg,
Welcome to Gentoo.
You have raid1, not raid10. Raid10 needs at least four drives.
Its normal to partition your drives that donate the partitions to raid sets. Thus you have sd[ab]1 in your md1, which looks like your /boot
and sd[ab2] for everything else. Unless you have more drives you are not telling us about.
You appear to be using software raid of one sort or another. Its quite possible to run some raid controllers in a Just A Bunch of Drives (JBOD) mode.
I'm guessing that once upon a time, the system was set up with three raid sets provided by the raid controller but that is old information now.
Code: | I/O error, dev sda, sector 842989744
I/O error, dev sda, sector 1250050440 | suggests that sda is having problems in at least two locations.
The system should still boot normally, as the raid will operate in degraded mode. That its read only, suggests there may be other issues.
If you need the data off this system, make a backup now.
The data requested by frostschutz will tell if the drives themselves think that they have problems. _________________ Regards,
NeddySeagoon
Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail. |
|
Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|