Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
RAID cluster Drive Failure
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
rubyg
n00b
n00b


Joined: 22 Sep 2016
Posts: 1

PostPosted: Thu Sep 22, 2016 5:49 pm    Post subject: RAID cluster Drive Failure Reply with quote

I apologize in advance that I am very new to this and so am struggling with some very potentially basic issues. I have inherited a cluster at work that is having some issues. Sadly, the person who set up the cluster is no longer working here and will not reply to emails so is unwilling to explain how he set up this system. I have been left with exactly 1 page of notes and am trying to piece the rest together with my limited knowledge.

The system is a RAID10 cluster in a SuperMicro manifold. The cluster was set up using rocks, per the notes I have, but the current linux distribution appears to be Gentoo (as per /proc/version). From lspci is appears the crontroller is an Areca ARC-1880 8/12 port PCIe/PCI-X to SAS/SATA II RAID Controller.

From the controller configuration BIOS I can see that there are three volume sets (If someone could explain what this means I would greatly appreciate it).

I hope the above is sufficient information to start tracking down the issue I am having or at least for someone to be able to point me in the direction of someone else who might be able to help.

Recently I have been getting I/O errors in the dmesg log telling me that I have a drive (sda2) failing:


Code:
261894.168151] Write(10): 2a 00 32 3f 00 b0 00 03 78 00
[261894.168160] end_request: I/O error, dev sda, sector 842989744
[261894.168190] md/raid1:md2: [b]Disk failure on sda2[/b], disabling device.
md/raid1:md2: Operation continuing on 1 devices.
[261894.168206] sd 1:0:0:0: [sda] Unhandled error code
[261894.168209] sd 1:0:0:0: [sda]
[261894.168210] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[261894.168211] sd 1:0:0:0: [sda] CDB:
[261894.168212] Write(10): 2a 00 3d de 49 90 00 00 08 00
[261894.168217] end_request: [b]I/O error, dev sda, sector 1037978000[/b]
[261894.168222] sd 1:0:0:0: [sda] Unhandled error code
[261894.168224] sd 1:0:0:0: [sda]
[261894.168225] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
[261894.168227] sd 1:0:0:0: [sda] CDB:
[261894.168227] Write(10): 2a 00 4a 82 41 88 00 00 08 00
[261894.168232] end_request: I/O error, dev sda, sector 1250050440
[261894.168236] sd 1:0:0:0: [sda] Unhandled error code


I first noticed these errors because they put the entire system into a read-only file system such that I can not even reboot the system (is there a way to make the system all rebooting again? I have tried logging into the root but even then it will not find the command).

When I run 'cat /proc/mdstat' I can see that one of the drives is failed:

Code:
$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
md1 : active raid1 sdb1[1] sda1[0]
204736 blocks [2/2] [UU]
md2 : active raid1 [b]sda2[2](W)(F)[/b] sdb2[1](W)
976556672 blocks[b] [2/1] [_U][/b]
unused devices: <none>


When I reboot the cluster this drive (sda2) will recover (it takes ~14-16 hours for this to complete).

I have looked in the Controller BIOS event log and there are not events corresponding to the failure being detected by the operating system. In the RAID manifold there are LED lights that are supposed to indicate and mark failed drives but the LEDs are not showing any problems.

Based on this, is there a physical drive failing or what appears to be happening? I know this is a very open ended question and again apologize for this.

If there is additional information that may be helpful I am happy to provide it. I have root control of the cluster and full physical access, as well.

Thanks so much in advance for any and all assistance with this.

Sincerely,
Ruby

Formatting fixed and code tags added by NeddySeagoon
Back to top
View user's profile Send private message
frostschutz
Advocate
Advocate


Joined: 22 Feb 2005
Posts: 2970
Location: Germany

PostPosted: Thu Sep 22, 2016 11:00 pm    Post subject: Reply with quote

so basically it's software raid, never mind the controller... you should be clear which type of raid you're using lest you have two systems fighting one another

looks like physical drive failure, smartctl -a for your drives?

please use code tags for output
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43178
Location: 56N 3W

PostPosted: Fri Sep 23, 2016 9:41 am    Post subject: Reply with quote

rubyg,

Welcome to Gentoo.

You have raid1, not raid10. Raid10 needs at least four drives.
Its normal to partition your drives that donate the partitions to raid sets. Thus you have sd[ab]1 in your md1, which looks like your /boot
and sd[ab2] for everything else. Unless you have more drives you are not telling us about.

You appear to be using software raid of one sort or another. Its quite possible to run some raid controllers in a Just A Bunch of Drives (JBOD) mode.
I'm guessing that once upon a time, the system was set up with three raid sets provided by the raid controller but that is old information now.

Code:
I/O error, dev sda, sector 842989744
I/O error, dev sda, sector 1250050440
suggests that sda is having problems in at least two locations.

The system should still boot normally, as the raid will operate in degraded mode. That its read only, suggests there may be other issues.
If you need the data off this system, make a backup now.

The data requested by frostschutz will tell if the drives themselves think that they have problems.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum