Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[Solved] System lock, maybe RAID?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
MikeHartman
n00b
n00b


Joined: 27 Jul 2009
Posts: 29

PostPosted: Sun Sep 05, 2010 10:27 pm    Post subject: [Solved] System lock, maybe RAID? Reply with quote

I haven't changed anything on my server in a few days, and everything has been running as normal.

I had a few open ssh prompts to the server, one using screen and one without, and noticed that they'd stopped responding - I couldn't type anything. I killed those terminals on my end and tried starting up a new one. It asked my for my password and accepted it, I even saw the MOTD and the prompt, but then it hung - I couldn't type anything again. I tried it several more times and with a few different users - same issue every time. Once I managed to get an "l" typed in before it hung, but I'm not sure if that actually went to the server or if it was just a client-server sync thing.

I physically sat down in front of the server and it seems like a similar thing there. The status LED on the monitor indicates it's getting a signal (it's not sleeping) but the screen is just black instead of displaying the terminal I left it on (X wasn't running at this time). I tried to switch over to another vt but it doesn't look like anything is happening - nothing changes on the blank screen anyway.

The interesting thing is, I've tried ssh connections that run a command instead of /bin/bash and they seem to work. I can open gedit remotely, list directories, list processes, etc. So the system itself doesn't seem to be frozen. My remote ssh mounts of various directories all still work and I can browse around the system. If I list processes I can see a bunch of bash instances out there, presumably from all the times I've tried to log in:

ps aux | grep bash:
Code:

mike       978  0.0  0.0   3140  1492 ?        Ds   04:47   0:00 -bash
mike      1066  0.0  0.0   3140  1496 pts/7    Ss+  04:49   0:00 -bash
root      1124  0.0  0.0   2748  1424 pts/8    Ss+  04:50   0:00 -bash
root      1546  0.0  0.0   2748  1424 ?        Ds   05:01   0:00 -bash
mike      1977  0.0  0.0   3140  1500 pts/10   Ss+  05:13   0:00 -bash
mike      2417  0.0  0.0   3140  1496 pts/11   Ss+  05:24   0:00 -bash
root      4592  0.0  0.0   2884  1648 pts/5    S+   Sep05   0:00 bash
root      5943  0.0  0.0   2740  1008 ?        Ss   Sep04   0:00 /bin/bash /usr/bin/pyTivo
mike     15898  0.0  0.0   3144  1612 pts/1    Ss+  Sep04   0:00 -/bin/bash
mike     19308  0.0  0.0   3140  1628 ?        Ds   Sep04   0:00 -bash
mike     28478  0.0  0.0   3144  1508 pts/3    Ss+  Sep05   0:00 -/bin/bash
mike     28789  0.0  0.0   3144  1508 pts/4    Ss   Sep05   0:00 -/bin/bash
root     28804  0.0  0.0   2880  1628 pts/4    S+   Sep05   0:00 bash
mike     30915  0.0  0.0   3144  1620 pts/5    Ss   Sep05   0:00 -/bin/bash


I'm trying to avoid restarting the machine because there's a long-running RAID reshaping (5->6) that's been going on for almost a day and is expected to go on for another 2. I'm afraid that's got something to do with the problem, judging by the dmesg output attached below, but I can't figure out why. These RAID arrays are only used for storage - they're not involved with anything system-related. And the system drives are still functioning - I can list directory contents and stuff. These arrays and their mount points aren't mentioned in the bash profiles or anything like that.

Does anyone have any idea:

- Whether my reshaping is still occurring as it should or whether it's halted?
- If it's halted, any way to restart it?
- How any of this would explain why bash is hung but other processes are ok?

dmesg:
Code:

 sdg:
 sdk:
 sdg: sdg1
 sdk: sdk1
md: bind<sdg1>
md: bind<sdk1>
md/raid0:md1: looking at sdk1
md/raid0:md1:   comparing sdk1(1465141760) with sdk1(1465141760)
md/raid0:md1:   END
md/raid0:md1:   ==> UNIQUE
md/raid0:md1: 1 zones
md/raid0:md1: looking at sdg1
md/raid0:md1:   comparing sdg1(1465141760) with sdk1(1465141760)
md/raid0:md1:   EQUAL
md/raid0:md1: FINAL 1 zones
md/raid0:md1: done.
md/raid0:md1: md_size is 2930283520 sectors.
******* md1 configuration *********
zone0=[sdg1/sdk1/]
        zone offset=0kb device offset=0kb size=1465141760kb
**********************************

md1: detected capacity change from 0 to 1500305162240
md1: detected capacity change from 0 to 1500305162240
 md1: unknown partition table
 md1: p1
 md1: p1
md: bind<md1p1>
md/raid:md0: device sdj1 operational as raid disk 2
md/raid:md0: device sdh1 operational as raid disk 1
md/raid:md0: device sdi1 operational as raid disk 0
md/raid:md0: allocated 4221kB
md/raid:md0: raid level 6 active with 3 out of 4 devices, algorithm 18
RAID conf printout:
 --- level:6 rd:4 wd:3
 disk 0, o:1, dev:sdi1
 disk 1, o:1, dev:sdh1
 disk 2, o:1, dev:sdj1
md0: Warning: Device md1p1 is misaligned
RAID conf printout:
 --- level:6 rd:4 wd:3
 disk 0, o:1, dev:sdi1
 disk 1, o:1, dev:sdh1
 disk 2, o:1, dev:sdj1
 disk 3, o:1, dev:md1p1
md: reshape of RAID array md0
md: minimum _guaranteed_  speed: 200000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
md: using 128k window, over a total of 1464845568 blocks.
EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null)
EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null)
EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null)
INFO: task events/1:10 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
events/1      D f70b1ecc     0    10      2 0x00000000
 f7080370 00000046 00000002 f70b1ecc f7051b80 f70804dc 00000001 f69ff500
 00000000 00000000 00000003 d9c5a530 00000000 00000282 00000092 00000246
 c1051ee1 00000001 f6a1b800 c1051cb0 00000000 00000000 c14eacdd 00000001
Call Trace:
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14eacdd>] ? make_request+0x2ed/0x860
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14f28ce>] ? md_submit_barrier+0x8e/0xd0
 [<c104ea57>] ? worker_thread+0x127/0x220
 [<c14f2840>] ? md_submit_barrier+0x0/0xd0
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c104e930>] ? worker_thread+0x0/0x220
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30
INFO: task md0_raid5:5428 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md0_raid5     D dd5c7eb8     0  5428      2 0x00000008
 cf4a0dc0 00000046 00000002 dd5c7eb8 c10e4383 cf4a0f2c 00000001 cfd44700
 00000000 00000401 f6bea180 d9c10300 00000001 c12dbc88 d9c10348 00000246
 c1051ee1 00000001 f6a1b800 f6a1b9d8 f6a1b9dc 00000000 c14eeffb 00000000
Call Trace:
 [<c10e4383>] ? __bio_add_page+0x163/0x1e0
 [<c12dbc88>] ? submit_bio+0x48/0xc0
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c14eeffb>] ? md_super_wait+0xbb/0xd0
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14efa9d>] ? md_update_sb+0x2cd/0x490
 [<c14f4122>] ? md_check_recovery+0x1f2/0x4d0
 [<c14eb26a>] ? raid5d+0x1a/0x490
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c14f463a>] ? md_thread+0x2a/0xe0
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14f4610>] ? md_thread+0x0/0xe0
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30
INFO: task md0_reshape:5438 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md0_reshape   D f65be968     0  5438      2 0x00000000
 cf4a1ef0 00000046 f65be96c f65be968 00000000 cf4a205c 00000000 f69ff500
 00000000 00000000 00000003 e4f06300 c14e2917 00000286 c6b7c000 00000246
 c1051ee1 00000000 dd7aaaf8 00000000 d114be68 00000000 c14e951a 00000001
Call Trace:
 [<c14e2917>] ? __release_stripe+0xa7/0x140
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c14e951a>] ? sync_request+0xb1a/0x1140
 [<c13e2650>] ? ata_scsi_queuecmd+0xd0/0x210
 [<c13df110>] ? ata_scsi_rw_xlat+0x0/0x220
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14e5233>] ? unplug_slaves+0x63/0xa0
 [<c14e8a00>] ? sync_request+0x0/0x1140
 [<c14f69c1>] ? md_do_sync+0xab1/0x1020
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14f463a>] ? md_thread+0x2a/0xe0
 [<c1035c9d>] ? complete+0x3d/0x60
 [<c14f4610>] ? md_thread+0x0/0xe0
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30
INFO: task jbd2/md0-8:16712 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/md0-8    D f563be5c     0 16712      2 0x00000000
 f54bdb80 00000046 00000002 f563be5c 00000086 f54bdcec 00000000 f6be8fc0
 00000003 f68325a0 dd7aaa00 f6a1b800 00000000 c14ebf55 00000292 c14e52d0
 dd7aab0c 00000001 c3808280 f54bdb80 f563bea8 c35edbe4 c16007a1 f563bea0
Call Trace:
 [<c14ebf55>] ? md_wakeup_thread+0x25/0x30
 [<c14e52d0>] ? raid5_unplug_device+0x60/0xc0
 [<c16007a1>] ? io_schedule+0x31/0x50
 [<c10e0cd5>] ? sync_buffer+0x35/0x40
 [<c1600bd2>] ? __wait_on_bit+0x42/0x70
 [<c10e0ca0>] ? sync_buffer+0x0/0x40
 [<c1051d00>] ? wake_bit_function+0x0/0x60
 [<c10e0ca0>] ? sync_buffer+0x0/0x40
 [<c1600c72>] ? out_of_line_wait_on_bit+0x72/0x90
 [<c1051d00>] ? wake_bit_function+0x0/0x60
 [<c10e0c26>] ? __wait_on_buffer+0x26/0x30
 [<c118cc87>] ? jbd2_journal_commit_transaction+0x737/0x12e0
 [<c1192b63>] ? kjournald2+0x93/0x1d0
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c1192ad0>] ? kjournald2+0x0/0x1d0
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30
INFO: task flush-9:0:31449 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-9:0     D ce7ddb98     0 31449      2 0x00000000
 f54be940 00000046 00000002 ce7ddb98 ccffff88 f54beaac 00000000 f6be8fc0
 00000000 c106ff64 c1070360 c3803d60 00000000 00000246 00000046 00000246
 c1051ee1 00000001 f6a1b800 f6a1b9dc ce7ddbbc 00000001 c14f4485 00001000
Call Trace:
 [<c106ff64>] ? cpu_needs_another_gp+0x14/0x20
 [<c1070360>] ? rcu_start_gp+0x130/0x1a0
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c14f4485>] ? md_make_request+0x85/0x210
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c12d8437>] ? generic_make_request+0x267/0x3f0
 [<c10908e2>] ? mempool_alloc+0x32/0x100
 [<c12dbc88>] ? submit_bio+0x48/0xc0
 [<c10e445f>] ? bio_init+0xf/0x30
 [<c10e4e66>] ? bio_alloc_bioset+0x46/0xd0
 [<c10e3070>] ? end_buffer_async_write+0x0/0x100
 [<c10dfbe8>] ? submit_bh+0xd8/0x130
 [<c10e21f2>] ? __block_write_full_page+0x212/0x380
 [<c1158880>] ? noalloc_get_block_write+0x0/0x40
 [<c10e240a>] ? block_write_full_page_endio+0xaa/0xd0
 [<c10e3070>] ? end_buffer_async_write+0x0/0x100
 [<c1158880>] ? noalloc_get_block_write+0x0/0x40
 [<c10e243f>] ? block_write_full_page+0xf/0x20
 [<c10e3070>] ? end_buffer_async_write+0x0/0x100
 [<c1156b2b>] ? mpage_da_submit_io+0xbb/0x110
 [<c115bb67>] ? ext4_da_writepages+0x467/0xa60
 [<c115b700>] ? ext4_da_writepages+0x0/0xa60
 [<c1094c4a>] ? do_writepages+0x1a/0x30
 [<c10da4f4>] ? writeback_single_inode+0x74/0x230
 [<c10daa15>] ? writeback_sb_inodes+0xa5/0x110
 [<c10dad22>] ? writeback_inodes_wb+0xe2/0x110
 [<c10daf0f>] ? wb_writeback+0x1bf/0x210
 [<c1044b07>] ? lock_timer_base+0x27/0x60
 [<c16008e7>] ? schedule_timeout+0x107/0x240
 [<c10daff3>] ? wb_do_writeback+0x93/0x150
 [<c10db09a>] ? wb_do_writeback+0x13a/0x150
 [<c10db0e2>] ? bdi_writeback_task+0x32/0xf0
 [<c10a10e0>] ? bdi_start_fn+0x0/0xa0
 [<c10a1132>] ? bdi_start_fn+0x52/0xa0
 [<c10a10e0>] ? bdi_start_fn+0x0/0xa0
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30
INFO: task events/1:10 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
events/1      D f70b1ecc     0    10      2 0x00000000
 f7080370 00000046 00000002 f70b1ecc f7051b80 f70804dc 00000001 f69ff500
 00000000 00000000 00000003 d9c5a530 00000000 00000282 00000092 00000246
 c1051ee1 00000001 f6a1b800 c1051cb0 00000000 00000000 c14eacdd 00000001
Call Trace:
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14eacdd>] ? make_request+0x2ed/0x860
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14f28ce>] ? md_submit_barrier+0x8e/0xd0
 [<c104ea57>] ? worker_thread+0x127/0x220
 [<c14f2840>] ? md_submit_barrier+0x0/0xd0
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c104e930>] ? worker_thread+0x0/0x220
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30
INFO: task md0_raid5:5428 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md0_raid5     D dd5c7eb8     0  5428      2 0x00000008
 cf4a0dc0 00000046 00000002 dd5c7eb8 c10e4383 cf4a0f2c 00000001 cfd44700
 00000000 00000401 f6bea180 d9c10300 00000001 c12dbc88 d9c10348 00000246
 c1051ee1 00000001 f6a1b800 f6a1b9d8 f6a1b9dc 00000000 c14eeffb 00000000
Call Trace:
 [<c10e4383>] ? __bio_add_page+0x163/0x1e0
 [<c12dbc88>] ? submit_bio+0x48/0xc0
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c14eeffb>] ? md_super_wait+0xbb/0xd0
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14efa9d>] ? md_update_sb+0x2cd/0x490
 [<c14f4122>] ? md_check_recovery+0x1f2/0x4d0
 [<c14eb26a>] ? raid5d+0x1a/0x490
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c14f463a>] ? md_thread+0x2a/0xe0
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14f4610>] ? md_thread+0x0/0xe0
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30
INFO: task md0_reshape:5438 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md0_reshape   D f65be968     0  5438      2 0x00000000
 cf4a1ef0 00000046 f65be96c f65be968 00000000 cf4a205c 00000000 f69ff500
 00000000 00000000 00000003 e4f06300 c14e2917 00000286 c6b7c000 00000246
 c1051ee1 00000000 dd7aaaf8 00000000 d114be68 00000000 c14e951a 00000001
Call Trace:
 [<c14e2917>] ? __release_stripe+0xa7/0x140
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c14e951a>] ? sync_request+0xb1a/0x1140
 [<c13e2650>] ? ata_scsi_queuecmd+0xd0/0x210
 [<c13df110>] ? ata_scsi_rw_xlat+0x0/0x220
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14e5233>] ? unplug_slaves+0x63/0xa0
 [<c14e8a00>] ? sync_request+0x0/0x1140
 [<c14f69c1>] ? md_do_sync+0xab1/0x1020
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14f463a>] ? md_thread+0x2a/0xe0
 [<c1035c9d>] ? complete+0x3d/0x60
 [<c14f4610>] ? md_thread+0x0/0xe0
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30
INFO: task jbd2/md0-8:16712 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/md0-8    D f563be5c     0 16712      2 0x00000000
 f54bdb80 00000046 00000002 f563be5c 00000086 f54bdcec 00000000 f6be8fc0
 00000003 f68325a0 dd7aaa00 f6a1b800 00000000 c14ebf55 00000292 c14e52d0
 dd7aab0c 00000001 c3808280 f54bdb80 f563bea8 c35edbe4 c16007a1 f563bea0
Call Trace:
 [<c14ebf55>] ? md_wakeup_thread+0x25/0x30
 [<c14e52d0>] ? raid5_unplug_device+0x60/0xc0
 [<c16007a1>] ? io_schedule+0x31/0x50
 [<c10e0cd5>] ? sync_buffer+0x35/0x40
 [<c1600bd2>] ? __wait_on_bit+0x42/0x70
 [<c10e0ca0>] ? sync_buffer+0x0/0x40
 [<c1051d00>] ? wake_bit_function+0x0/0x60
 [<c10e0ca0>] ? sync_buffer+0x0/0x40
 [<c1600c72>] ? out_of_line_wait_on_bit+0x72/0x90
 [<c1051d00>] ? wake_bit_function+0x0/0x60
 [<c10e0c26>] ? __wait_on_buffer+0x26/0x30
 [<c118cc87>] ? jbd2_journal_commit_transaction+0x737/0x12e0
 [<c1192b63>] ? kjournald2+0x93/0x1d0
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c1192ad0>] ? kjournald2+0x0/0x1d0
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30
INFO: task flush-9:0:31449 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-9:0     D ce7ddb98     0 31449      2 0x00000000
 f54be940 00000046 00000002 ce7ddb98 ccffff88 f54beaac 00000000 f6be8fc0
 00000000 c106ff64 c1070360 c3803d60 00000000 00000246 00000046 00000246
 c1051ee1 00000001 f6a1b800 f6a1b9dc ce7ddbbc 00000001 c14f4485 00001000
Call Trace:
 [<c106ff64>] ? cpu_needs_another_gp+0x14/0x20
 [<c1070360>] ? rcu_start_gp+0x130/0x1a0
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c14f4485>] ? md_make_request+0x85/0x210
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c12d8437>] ? generic_make_request+0x267/0x3f0
 [<c10908e2>] ? mempool_alloc+0x32/0x100
 [<c12dbc88>] ? submit_bio+0x48/0xc0
 [<c10e445f>] ? bio_init+0xf/0x30
 [<c10e4e66>] ? bio_alloc_bioset+0x46/0xd0
 [<c10e3070>] ? end_buffer_async_write+0x0/0x100
 [<c10dfbe8>] ? submit_bh+0xd8/0x130
 [<c10e21f2>] ? __block_write_full_page+0x212/0x380
 [<c1158880>] ? noalloc_get_block_write+0x0/0x40
 [<c10e240a>] ? block_write_full_page_endio+0xaa/0xd0
 [<c10e3070>] ? end_buffer_async_write+0x0/0x100
 [<c1158880>] ? noalloc_get_block_write+0x0/0x40
 [<c10e243f>] ? block_write_full_page+0xf/0x20
 [<c10e3070>] ? end_buffer_async_write+0x0/0x100
 [<c1156b2b>] ? mpage_da_submit_io+0xbb/0x110
 [<c115bb67>] ? ext4_da_writepages+0x467/0xa60
 [<c115b700>] ? ext4_da_writepages+0x0/0xa60
 [<c1094c4a>] ? do_writepages+0x1a/0x30
 [<c10da4f4>] ? writeback_single_inode+0x74/0x230
 [<c10daa15>] ? writeback_sb_inodes+0xa5/0x110
 [<c10dad22>] ? writeback_inodes_wb+0xe2/0x110
 [<c10daf0f>] ? wb_writeback+0x1bf/0x210
 [<c1044b07>] ? lock_timer_base+0x27/0x60
 [<c16008e7>] ? schedule_timeout+0x107/0x240
 [<c10daff3>] ? wb_do_writeback+0x93/0x150
 [<c10db09a>] ? wb_do_writeback+0x13a/0x150
 [<c10db0e2>] ? bdi_writeback_task+0x32/0xf0
 [<c10a10e0>] ? bdi_start_fn+0x0/0xa0
 [<c10a1132>] ? bdi_start_fn+0x52/0xa0
 [<c10a10e0>] ? bdi_start_fn+0x0/0xa0
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30


Last edited by MikeHartman on Wed Mar 23, 2011 3:45 am; edited 3 times in total
Back to top
View user's profile Send private message
MikeHartman
n00b
n00b


Joined: 27 Jul 2009
Posts: 29

PostPosted: Sun Sep 05, 2010 11:07 pm    Post subject: Reply with quote

Pulled those same messages out of /var/log/messages to get some timestamps on events, but I don't see any additional helpful info. FYI, system clock seems to be running about 12 hours ahead for some reason, but I doubt that's related.

Summary:

9/5 15:23 - RAID reshaping started ok

9/6 02:52 - Ext4 filesystem on that array is mounted (which all the instructions I've seen insist is ok to do while reshaping is in progress)

9/6 02:55 - Ext4 filesystem is remounted

9/6 02:58 - Ext4 filesystem is remounted

9/6 03:54-03:56 - All those "blocked" event messages, most relating to the md0 RAID array

9/6 04:46-current - Bunch of ssh connects and disconnects after I noticed it wasn't responding. No error messages.


I've trying unmounting the ext4 filesystem on the md0 array to see if that would help, but it keeps telling me the filesystem is in use. lsof shows that an sftp-server process is hanging onto it. I've tried "kill -9"ing the process, but it's still there, and with the same pid. I assume it's in an uninterruptable sleep state waiting on some input. I tried restarting sshd thinking that might free up the parent process or something, but no dice. At this point I'm not sure if the mounted and locked filesystem is the cause of all these other problems or just one more symptom.

** And just in the time it took me to type this last post something else seems to have happened, because now any command I attempt to run via ssh just sits there after I enter my password and eventually times out with a "Write failed: broken pipe" error.

Question for any RAID experts out there:

My RAID seems to be in a reasonably consistent state, since I was able to see all the files in that filesystem ok around the time everything locked up. If I get to the point where I have to restart the system, will it pick up the reshaping where it left off? Start it over? Be totally borked? Here's a description of the process I took to get to where I am:

Started with a 3-disk RAID5, based on 1.5 TB primary partitions on 1.5 TB disks.

mdadm --add /dev/md0 /dev/md1p1 // Added new 1.5 TB partition as hot spare

mdadm --grow /dev/md0 --level=6 --raid-devices=4 --backup-file=/convert_md0.bak // Switched to RAID 6, where mdadm should have detected the hot spare as a failed 2nd parity disk and rebuilt it accordingly - and that seems to be what started happening without problems

Last I checked before all the madness started it was about 12% of the way through the process.
Back to top
View user's profile Send private message
MikeHartman
n00b
n00b


Joined: 27 Jul 2009
Posts: 29

PostPosted: Mon Sep 06, 2010 3:25 am    Post subject: Reply with quote

Well, after much Googling around I didn't find any immediately useful info. The general vibe seems to be that resuming something like this should work though, so I finally gave up and restarted the machine. The /dev/md0 array wasn't rebuilt automatically (I wasn't expecting it to) but everything else seems to have come up ok and I can use bash/ssh/etc. again.

Trying to start /dev/md0 gave me an error message that said it couldn't restore the "critical area" - the first bit of the reshaping that the rest of the process depends on. It suggested I try starting it again with the backup file I'd used (see previous post). The purpose of that backup file is to hold the contents of this critical area so that an interrupted reshaping can be resumed. It took me some fiddling to figure out the correct syntax, but this eventually did it:

Code:
mdadm --assemble /dev/md0 --backup-file=/convert_md0.bak --force


That put me back in the same state I was before everything locked up. The array is marked active, and "cat /proc/mdstat" shows the reshaping process still happening, at about the same position it was before and at about the same speed. I'm able to mount it's filesystem, and I listed the contents of a few directories - everything seems to be there. I'll run an fsck on it when the conversion is done to be sure, but in the meantime I've unmounted the filesystem. I'm still not sure whether having it mounted contributed to the problem, although everything I've read indicates that should be ok.

We'll see how it goes. If it locks up again while the filesystem is unmounted I'm going to be very confused and concerned. Although at least now I know it's a recoverable situation.
Back to top
View user's profile Send private message
boerKrelis
Apprentice
Apprentice


Joined: 01 Jul 2003
Posts: 241
Location: The Netherlands

PostPosted: Mon Sep 06, 2010 10:01 am    Post subject: Reply with quote

MikeHartman wrote:
Does anyone have any idea:
- Whether my reshaping is still occurring as it should or whether it's halted?


Try
Code:
cat /proc/mdstat
Back to top
View user's profile Send private message
MikeHartman
n00b
n00b


Joined: 27 Jul 2009
Posts: 29

PostPosted: Mon Sep 06, 2010 5:07 pm    Post subject: Reply with quote

I couldn't at the time. That was one of the few commands I couldn't get to run via my ssh workaround.

Code:
ssh root@odin cat /proc/mdstat


would prompt me for my password and then just hang, the same as trying to do a regular login with bash. I assume that's because the md system was involved in whatever it was exactly that locked up. I'm still not sure why that would have prevented bash from running though.

At any rate, after I restarted the system and manually reassembled md0 with the backup file it seemed to pick back up ok around 14% (about where it was the last time I was able to check before the freeze). No more lockups so far.

Still continuing at the same abysmally slow speed it was before the lockup too - 4500-4900K/s. This despite the fact that the slowest of the component drives gives an hdparm buffered read speed of 60MB/s+ and so does the /dev/md0 device itself. From googling around it seems like reshaping from RAID5 to RAID6 is just much slower than a simple grow operation. The sync_speed_min and sync_speed_max are both set up around 20000, well over the speed I'm seeing, so I think it's just the fastest it's going to get, unfortunately.
Back to top
View user's profile Send private message
MikeHartman
n00b
n00b


Joined: 27 Jul 2009
Posts: 29

PostPosted: Fri Sep 10, 2010 7:34 pm    Post subject: Reply with quote

UPDATE

The same thing is happening again. The RAID reshaping finished successfully the other day, so that doesn't seem to be a factor, although the RAID itself is still involved. All I've done with it is mount the filesystem (after fscking) and copy a bunch of stuff into it. At least 1.3TB transferred into it with no problems before the lockup started happening again.

I still seem to be in the early stages of what happened before. I can ssh into the box and see the MOTD (by the end of the previous lockup it would hang before even getting that far) but I can't enter any text. I can run alternate programs (ssh odin <command>), which is how I got the dmesg output below. I tried "ssh odin cat /proc/mdstat" and it hangs after entering the password, same as last time.

So this is the second time in less than a week that this gentoo box has hard-locked on me, with no way to get out of it short of physically turning it off at the switch. Most definitely not normal, and this time there wasn't even an unusual system activity going on that might have explained it. The only two things that have changed are the kernel (I updated it to 2.6.35-gentoo-r4 from 2.6.24-gentoo-r8 to get reasonably fresh mdadm support) and the RAID (just added, with brand-new drives and ESATA card). I could see either one of those things being responsible, but this seems like a pretty big issue for a stable kernel and new hard drives that test out fine individually.

Does anyone have any ideas?

dmesg output:

Code:

INFO: task events/0:9 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
events/0      D f7085ea8     0     9      2 0x00000000
 f7080000 00000046 00000002 f7085ea8 c79fd4a0 f708016c 00000000 efaacc40
 ef099e68 00000000 00000001 00000092 00000000 00000000 00000000 00000246
 c1051ee1 00000001 f6981c00 f6981ddc 00000000 00000000 c14f4c5d 00000000
Call Trace:
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c14f4c5d>] ? md_write_start+0xad/0x170
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14eaa24>] ? make_request+0x34/0x860
 [<c102fe21>] ? finish_task_switch+0x31/0x90
 [<c160005f>] ? schedule+0x1bf/0x650
 [<c14f28ce>] ? md_submit_barrier+0x8e/0xd0
 [<c104ea57>] ? worker_thread+0x127/0x220
 [<c14f2840>] ? md_submit_barrier+0x0/0xd0
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c104e930>] ? worker_thread+0x0/0x220
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30
INFO: task md0_raid6:6845 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md0_raid6     D ee7c9eb8     0  6845      2 0x00000000
 ef093700 00000046 00000002 ee7c9eb8 c10e4383 ef09386c 00000001 f0ca3c00
 00000000 00000401 f1d01540 c7a98500 00000000 c12dbc88 c7a98548 00000246
 c1051ee1 00000001 f6981c00 f6981dd8 f6981ddc 00000000 c14eeffb 00000000
Call Trace:
 [<c10e4383>] ? __bio_add_page+0x163/0x1e0
 [<c12dbc88>] ? submit_bio+0x48/0xc0
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c14eeffb>] ? md_super_wait+0xbb/0xd0
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14efa9d>] ? md_update_sb+0x2cd/0x490
 [<c14f4122>] ? md_check_recovery+0x1f2/0x4d0
 [<c14eb26a>] ? raid5d+0x1a/0x490
 [<c101cc0c>] ? smp_apic_timer_interrupt+0x5c/0x90
 [<c101cc0c>] ? smp_apic_timer_interrupt+0x5c/0x90
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c14f463a>] ? md_thread+0x2a/0xe0
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14f4610>] ? md_thread+0x0/0xe0
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30
INFO: task jbd2/md0-8:12017 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/md0-8    D c86c7e5c     0 12017      2 0x00000000
 f1ef9130 00000046 00000002 c86c7e5c 00000086 f1ef929c 00000000 efaacc40
 00000003 f6b81020 efaa2c00 f6981c00 00000000 c14ebf55 00000292 c14e52d0
 efaa2d0c 00000001 c3808280 f1ef9130 c86c7ea8 c35ee8a4 c16007a1 c86c7ea0
Call Trace:
 [<c14ebf55>] ? md_wakeup_thread+0x25/0x30
 [<c14e52d0>] ? raid5_unplug_device+0x60/0xc0
 [<c16007a1>] ? io_schedule+0x31/0x50
 [<c10e0cd5>] ? sync_buffer+0x35/0x40
 [<c1600bd2>] ? __wait_on_bit+0x42/0x70
 [<c10e0ca0>] ? sync_buffer+0x0/0x40
 [<c1051d00>] ? wake_bit_function+0x0/0x60
 [<c10e0ca0>] ? sync_buffer+0x0/0x40
 [<c1600c72>] ? out_of_line_wait_on_bit+0x72/0x90
 [<c1051d00>] ? wake_bit_function+0x0/0x60
 [<c10e0c26>] ? __wait_on_buffer+0x26/0x30
 [<c118cc87>] ? jbd2_journal_commit_transaction+0x737/0x12e0
 [<c1192b63>] ? kjournald2+0x93/0x1d0
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c1192ad0>] ? kjournald2+0x0/0x1d0
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30
INFO: task flush-9:0:3296 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-9:0     D f1c454a0     0  3296      2 0x00000000
 c79fd4a0 00000046 f1c454cc f1c454a0 c1875de0 c79fd60c 00000000 efaacc40
 00000000 f1c454cc c79fd4a0 c10360cc c1931280 c3808280 00000002 00000246
 c1051ee1 00000000 f6981c00 f6981ddc e5735b90 00000001 c14f4485 00001000
Call Trace:
 [<c10360cc>] ? check_preempt_wakeup+0x7c/0xe0
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c14f4485>] ? md_make_request+0x85/0x210
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c12d8437>] ? generic_make_request+0x267/0x3f0
 [<c105fb4a>] ? tick_program_event+0x2a/0x40
 [<c10908e2>] ? mempool_alloc+0x32/0x100
 [<c12dbc88>] ? submit_bio+0x48/0xc0
 [<c10e445f>] ? bio_init+0xf/0x30
 [<c10e4e66>] ? bio_alloc_bioset+0x46/0xd0
 [<c10e3070>] ? end_buffer_async_write+0x0/0x100
 [<c10dfbe8>] ? submit_bh+0xd8/0x130
 [<c10e21f2>] ? __block_write_full_page+0x212/0x380
 [<c1158880>] ? noalloc_get_block_write+0x0/0x40
 [<c10e240a>] ? block_write_full_page_endio+0xaa/0xd0
 [<c10e3070>] ? end_buffer_async_write+0x0/0x100
 [<c1158880>] ? noalloc_get_block_write+0x0/0x40
 [<c10e243f>] ? block_write_full_page+0xf/0x20
 [<c10e3070>] ? end_buffer_async_write+0x0/0x100
 [<c1156b2b>] ? mpage_da_submit_io+0xbb/0x110
 [<c115857d>] ? mpage_add_bh_to_extent+0xed/0x120
 [<c115bf34>] ? ext4_da_writepages+0x834/0xa60
 [<c115b700>] ? ext4_da_writepages+0x0/0xa60
 [<c1094c4a>] ? do_writepages+0x1a/0x30
 [<c10da4f4>] ? writeback_single_inode+0x74/0x230
 [<c10daa15>] ? writeback_sb_inodes+0xa5/0x110
 [<c10dad22>] ? writeback_inodes_wb+0xe2/0x110
 [<c10daf0f>] ? wb_writeback+0x1bf/0x210
 [<c10dafe3>] ? wb_do_writeback+0x83/0x150
 [<c10db0e2>] ? bdi_writeback_task+0x32/0xf0
 [<c10a10e0>] ? bdi_start_fn+0x0/0xa0
 [<c10a1132>] ? bdi_start_fn+0x52/0xa0
 [<c10a10e0>] ? bdi_start_fn+0x0/0xa0
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30
INFO: task events/0:9 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
events/0      D f7085ea8     0     9      2 0x00000000
 f7080000 00000046 00000002 f7085ea8 c79fd4a0 f708016c 00000000 efaacc40
 ef099e68 00000000 00000001 00000092 00000000 00000000 00000000 00000246
 c1051ee1 00000001 f6981c00 f6981ddc 00000000 00000000 c14f4c5d 00000000
Call Trace:
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c14f4c5d>] ? md_write_start+0xad/0x170
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14eaa24>] ? make_request+0x34/0x860
 [<c102fe21>] ? finish_task_switch+0x31/0x90
 [<c160005f>] ? schedule+0x1bf/0x650
 [<c14f28ce>] ? md_submit_barrier+0x8e/0xd0
 [<c104ea57>] ? worker_thread+0x127/0x220
 [<c14f2840>] ? md_submit_barrier+0x0/0xd0
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c104e930>] ? worker_thread+0x0/0x220
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30
INFO: task md0_raid6:6845 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md0_raid6     D ee7c9eb8     0  6845      2 0x00000000
 ef093700 00000046 00000002 ee7c9eb8 c10e4383 ef09386c 00000001 f0ca3c00
 00000000 00000401 f1d01540 c7a98500 00000000 c12dbc88 c7a98548 00000246
 c1051ee1 00000001 f6981c00 f6981dd8 f6981ddc 00000000 c14eeffb 00000000
Call Trace:
 [<c10e4383>] ? __bio_add_page+0x163/0x1e0
 [<c12dbc88>] ? submit_bio+0x48/0xc0
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c14eeffb>] ? md_super_wait+0xbb/0xd0
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14efa9d>] ? md_update_sb+0x2cd/0x490
 [<c14f4122>] ? md_check_recovery+0x1f2/0x4d0
 [<c14eb26a>] ? raid5d+0x1a/0x490
 [<c101cc0c>] ? smp_apic_timer_interrupt+0x5c/0x90
 [<c101cc0c>] ? smp_apic_timer_interrupt+0x5c/0x90
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c14f463a>] ? md_thread+0x2a/0xe0
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14f4610>] ? md_thread+0x0/0xe0
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30
INFO: task jbd2/md0-8:12017 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/md0-8    D c86c7e5c     0 12017      2 0x00000000
 f1ef9130 00000046 00000002 c86c7e5c 00000086 f1ef929c 00000000 efaacc40
 00000003 f6b81020 efaa2c00 f6981c00 00000000 c14ebf55 00000292 c14e52d0
 efaa2d0c 00000001 c3808280 f1ef9130 c86c7ea8 c35ee8a4 c16007a1 c86c7ea0
Call Trace:
 [<c14ebf55>] ? md_wakeup_thread+0x25/0x30
 [<c14e52d0>] ? raid5_unplug_device+0x60/0xc0
 [<c16007a1>] ? io_schedule+0x31/0x50
 [<c10e0cd5>] ? sync_buffer+0x35/0x40
 [<c1600bd2>] ? __wait_on_bit+0x42/0x70
 [<c10e0ca0>] ? sync_buffer+0x0/0x40
 [<c1051d00>] ? wake_bit_function+0x0/0x60
 [<c10e0ca0>] ? sync_buffer+0x0/0x40
 [<c1600c72>] ? out_of_line_wait_on_bit+0x72/0x90
 [<c1051d00>] ? wake_bit_function+0x0/0x60
 [<c10e0c26>] ? __wait_on_buffer+0x26/0x30
 [<c118cc87>] ? jbd2_journal_commit_transaction+0x737/0x12e0
 [<c1192b63>] ? kjournald2+0x93/0x1d0
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c1192ad0>] ? kjournald2+0x0/0x1d0
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30
INFO: task flush-9:0:3296 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-9:0     D f1c454a0     0  3296      2 0x00000000
 c79fd4a0 00000046 f1c454cc f1c454a0 c1875de0 c79fd60c 00000000 efaacc40
 00000000 f1c454cc c79fd4a0 c10360cc c1931280 c3808280 00000002 00000246
 c1051ee1 00000000 f6981c00 f6981ddc e5735b90 00000001 c14f4485 00001000
Call Trace:
 [<c10360cc>] ? check_preempt_wakeup+0x7c/0xe0
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c14f4485>] ? md_make_request+0x85/0x210
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c12d8437>] ? generic_make_request+0x267/0x3f0
 [<c105fb4a>] ? tick_program_event+0x2a/0x40
 [<c10908e2>] ? mempool_alloc+0x32/0x100
 [<c12dbc88>] ? submit_bio+0x48/0xc0
 [<c10e445f>] ? bio_init+0xf/0x30
 [<c10e4e66>] ? bio_alloc_bioset+0x46/0xd0
 [<c10e3070>] ? end_buffer_async_write+0x0/0x100
 [<c10dfbe8>] ? submit_bh+0xd8/0x130
 [<c10e21f2>] ? __block_write_full_page+0x212/0x380
 [<c1158880>] ? noalloc_get_block_write+0x0/0x40
 [<c10e240a>] ? block_write_full_page_endio+0xaa/0xd0
 [<c10e3070>] ? end_buffer_async_write+0x0/0x100
 [<c1158880>] ? noalloc_get_block_write+0x0/0x40
 [<c10e243f>] ? block_write_full_page+0xf/0x20
 [<c10e3070>] ? end_buffer_async_write+0x0/0x100
 [<c1156b2b>] ? mpage_da_submit_io+0xbb/0x110
 [<c115857d>] ? mpage_add_bh_to_extent+0xed/0x120
 [<c115bf34>] ? ext4_da_writepages+0x834/0xa60
 [<c115b700>] ? ext4_da_writepages+0x0/0xa60
 [<c1094c4a>] ? do_writepages+0x1a/0x30
 [<c10da4f4>] ? writeback_single_inode+0x74/0x230
 [<c10daa15>] ? writeback_sb_inodes+0xa5/0x110
 [<c10dad22>] ? writeback_inodes_wb+0xe2/0x110
 [<c10daf0f>] ? wb_writeback+0x1bf/0x210
 [<c10dafe3>] ? wb_do_writeback+0x83/0x150
 [<c10db0e2>] ? bdi_writeback_task+0x32/0xf0
 [<c10a10e0>] ? bdi_start_fn+0x0/0xa0
 [<c10a1132>] ? bdi_start_fn+0x52/0xa0
 [<c10a10e0>] ? bdi_start_fn+0x0/0xa0
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30
INFO: task events/0:9 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
events/0      D f7085ea8     0     9      2 0x00000000
 f7080000 00000046 00000002 f7085ea8 c79fd4a0 f708016c 00000000 efaacc40
 ef099e68 00000000 00000001 00000092 00000000 00000000 00000000 00000246
 c1051ee1 00000001 f6981c00 f6981ddc 00000000 00000000 c14f4c5d 00000000
Call Trace:
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c14f4c5d>] ? md_write_start+0xad/0x170
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14eaa24>] ? make_request+0x34/0x860
 [<c102fe21>] ? finish_task_switch+0x31/0x90
 [<c160005f>] ? schedule+0x1bf/0x650
 [<c14f28ce>] ? md_submit_barrier+0x8e/0xd0
 [<c104ea57>] ? worker_thread+0x127/0x220
 [<c14f2840>] ? md_submit_barrier+0x0/0xd0
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c104e930>] ? worker_thread+0x0/0x220
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30
INFO: task md0_raid6:6845 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md0_raid6     D ee7c9eb8     0  6845      2 0x00000000
 ef093700 00000046 00000002 ee7c9eb8 c10e4383 ef09386c 00000001 f0ca3c00
 00000000 00000401 f1d01540 c7a98500 00000000 c12dbc88 c7a98548 00000246
 c1051ee1 00000001 f6981c00 f6981dd8 f6981ddc 00000000 c14eeffb 00000000
Call Trace:
 [<c10e4383>] ? __bio_add_page+0x163/0x1e0
 [<c12dbc88>] ? submit_bio+0x48/0xc0
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c14eeffb>] ? md_super_wait+0xbb/0xd0
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14efa9d>] ? md_update_sb+0x2cd/0x490
 [<c14f4122>] ? md_check_recovery+0x1f2/0x4d0
 [<c14eb26a>] ? raid5d+0x1a/0x490
 [<c101cc0c>] ? smp_apic_timer_interrupt+0x5c/0x90
 [<c101cc0c>] ? smp_apic_timer_interrupt+0x5c/0x90
 [<c1051ee1>] ? prepare_to_wait+0x21/0x70
 [<c14f463a>] ? md_thread+0x2a/0xe0
 [<c1051cb0>] ? autoremove_wake_function+0x0/0x50
 [<c14f4610>] ? md_thread+0x0/0xe0
 [<c1051a54>] ? kthread+0x74/0x80
 [<c10519e0>] ? kthread+0x0/0x80
 [<c1003036>] ? kernel_thread_helper+0x6/0x30
Back to top
View user's profile Send private message
MikeHartman
n00b
n00b


Joined: 27 Jul 2009
Posts: 29

PostPosted: Sat Sep 11, 2010 10:36 pm    Post subject: Reply with quote

I've started to pursue this issue on the linux-raid mailing list as well, so for anyone who finds this and is interested:

http://marc.info/?t=128422883700001&r=1&w=2
Back to top
View user's profile Send private message
dmpogo
Advocate
Advocate


Joined: 02 Sep 2004
Posts: 2522
Location: Canada

PostPosted: Sun Sep 12, 2010 12:13 am    Post subject: Reply with quote

Once you finished reshaping, did you try to reboot ?
Back to top
View user's profile Send private message
MikeHartman
n00b
n00b


Joined: 27 Jul 2009
Posts: 29

PostPosted: Sun Sep 12, 2010 12:22 am    Post subject: Reply with quote

I was in the middle of reshaping, and the system locked.

I hard rebooted the machine, brought everything back up manually and was able to resume the reshape from the backup file. Reshaping completed fine (although it took a few days).

A day or so after that it locked again, and this time I was only copying data onto the RAID's partition. No reshaping or resyncing or anything going on. I had to hard reboot again.

No more since then, but it hasn't been all that long relative to the period between lockups either.

So to answer your question, it's been rebooted once since the reshaping operation completed, although that too was the result of a lockup.

I don't think rebooting should have much effect anyway - this isn't my system drive we're talking about. It's a completely independent storage array with nothing but media files on it.
Back to top
View user's profile Send private message
MikeHartman
n00b
n00b


Joined: 27 Jul 2009
Posts: 29

PostPosted: Wed Mar 23, 2011 3:44 am    Post subject: Reply with quote

Just to follow up on this in case someone lands here from a search and doesn't want to dig through the even longer linked thread on the raid mailing list:

The problem turned out to be the combination of RAID and the barrier option on ext4. It seems to cause some kind of race condition during heavy writing scenarios. I've been mounting that filesystem with "-o barrier=0" for months with no further problems. The consensus seems to be that barriers are a nice thing to have but not critical - they were only added relatively recently and aren't even available with all hardware. So I feel relatively safe about avoiding it at least until the bug is fixed.
Back to top
View user's profile Send private message
MageSlayer
Apprentice
Apprentice


Joined: 26 Jul 2007
Posts: 250
Location: Ukraine

PostPosted: Thu Jan 30, 2014 9:42 am    Post subject: Reply with quote

Looks like kernel freezes related to ext4+RAID5+barriers on are still here (kernel 3.12.9).
Using mount option "-o barrier=0" definitely helps.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum