Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[solved]AACRAID offlined (I think); system crashed
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
dobbs
Tux's lil' helper
Tux's lil' helper


Joined: 20 Aug 2005
Posts: 103
Location: Wenatchee, WA

PostPosted: Wed Jan 15, 2014 9:16 am    Post subject: [solved]AACRAID offlined (I think); system crashed Reply with quote

About me:
I'm just a hobbyist, but have been using Gentoo since around 2002-ish. I run a project repository server for myself and a few friends (we're software developers). I enjoy cobbling-together old hardware and making it work, but I'm in a little over my head with this crash.

The hardware:
Normally headless Dell PowerEdge 2650 server with root on JFS on RAID0 (striped), and sensitive data on ReiserFS v3 on a 3-drive RAID5. RAID is a Dell PERC 3/Di SCSI backplane (hardware RAID with battery backup) using the AACRAID module.

Synopsis:
Drives went offline in the middle of "emerge -DavuN world", system forcibly rebooted.

The story so far (optional reading):
Partway through my most recent "emerge -DavuN world", I suddenly found myself unable to log-in. Some services continued to run, but I had no means of logging in (the emerge was left running under a screen session). SSH would connect but immediately close the connection. getty would go into a rapid restart loop and get throttled by init when I tried to enter my username on the console. The drive LEDs showed green status and zero disk activity. Eventually tried the magic SysReq key to sync disks and remount the filesystems read-only. The AACRAID module spit out errors claiming the disks were offline. Couldn't think of anything else to try, so forced a reboot.

As expected, the partitions were checked for validity on reboot, but no problems were reported. A few seemingly random and unrelated (to each other) errors appeared while loading services; logging-in produced some shell syntax errors. I eventually deduced that profile.env was corrupted with gibberish. So much for journaling preventing data loss. env-update fixed that so I rebooted cleanly, and everything loaded without error. I was also somewhat skeptical fsck did anything (it completed far too quickly), so I threw some extra flags at fsck. Still no reported problems, but at least it took substantially longer.

The aftermath:
There are 1584 files in /lost+found (the JFS root partition), most of which identify as gzip or bzip2 files, likely from the now entirely empty /usr/portage/distfiles directory. Portage can't understand /var/lib/portage/preserved_libs_registry:
Code:
!!! Error loading '/var/lib/portage/preserved_libs_registry': 'utf-8' codec can't decode byte 0xd2 in position 472: invalid continuation byte

The missing distfiles are not an issue. The possibly corrupted files pseudo-randomly distributed throughout my system have me worried. I could fix preserved_libs_registry with an "emerge -eva world", but I haven't addressed the root cause of the crash yet.

The real problem (as I understand it):
A couple months back, a disk in the RAID5 array started failing (it was OLD), so I replaced it. Being a hobbyist, I couldn't exactly afford the best SCSI drives, and basically cheaped-out and bought three Seagates to test out. I replace all three drives simultaneously. I'm not terribly worried about data loss there, RAID5, backups, constant monitoring... I prepared for a Seagate fail. A month after install, one of the new drives failed (not exactly a surprise from a Seagate...); it was throwing SMART prefail warnings and would occasionally drop offline. Drive was RMAed and replaced.

Unfortunately, the AACRAID module now has problems under heavy I/O. It will occasionally printk like so:
Quote:
[550207.327866] AAC:ID(0:01:0) Timeout detected on cmd[0x2a]
[550207.418505] AAC:SCSI Channel[0]: Timeout Detected On 168 Command(s)
[550217.294107] AAC:ID(0:01:0); Abort Timeout. Resetting Bus 0
[550217.295362] AAC:MpdResetScsiBus( ):SCSI bus reset issued on channel 0
[558549.833907] AAC:ID(0:01:0) Timeout detected on cmd[0x2a]
[558549.850094] AAC:SCSI Channel[0]: Timeout Detected On 28 Command(s)
[558559.584039] AAC:ID(0:01:0); Abort Timeout. Resetting Bus 0
[558559.585290] AAC:MpdResetScsiBus( ):SCSI bus reset issued on channel 0

Always the same ID. Never when system is idle. Takes a while under load before an error appears. Haven't tied CPU load (or rather heat) to the issue, but also haven't ruled it out.

Questions:
How can I verify that the root filesystem is sane?

Could someone tell me whether or not "AAC:ID(0:01:0)" equates to "sd 0:0:1:0: [sdb]"? sdb is the RAID5 container, and I would assume that error would refer to a physical disk. So maybe it means "scsi 0:1:1:0: Attached scsi generic sg3 type 0" despite differing bus numbers? Also, the heavy I/O during Portage builds is all on the sda container so I wouldn't expect it errors relating to drives not under load.

What am I looking at here, is my RAID controller failing? Recent bug in AACRAID (currently kernel v3.10.7)? One bad drive ruining the bus for the others?


Last edited by dobbs on Mon Jan 20, 2014 11:45 pm; edited 1 time in total
Back to top
View user's profile Send private message
dobbs
Tux's lil' helper
Tux's lil' helper


Joined: 20 Aug 2005
Posts: 103
Location: Wenatchee, WA

PostPosted: Mon Jan 20, 2014 11:44 pm    Post subject: Reply with quote

Note to self: afacli is the answer.

Code:
AFA0> container list
Executing: container list
Num          Total  Oth Chunk          Scsi   Partition
Label Type   Size   Ctr Size   Usage   B:ID:L Offset:Size
----- ------ ------ --- ------ ------- ------ -------------
 0    Stripe  135GB       64KB Valid   0:03:0 64.0KB:67.7GB
 /dev/sda             Boot             0:01:0 64.0KB:67.7GB

 1    RAID-5  273GB       64KB Valid   0:00:0 64.0KB: 136GB
 /dev/sdb             Data             0:02:0 64.0KB: 136GB
                                       0:04:0 64.0KB: 136GB

So I have a third failing disk. Bah.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum