Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[SOLVED] Suspend/Resume broke after setting up RAID-1
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
nburgin
n00b
n00b


Joined: 12 Sep 2017
Posts: 6

PostPosted: Tue Sep 12, 2017 8:16 pm    Post subject: [SOLVED] Suspend/Resume broke after setting up RAID-1 Reply with quote

I recently converted my /home partition to a RAID-1 (software raid set up with "mdadm" command) to protect its contents against drive failure. Mostly it works fine, except every time I try to suspend-to-RAM since I set it up, when I resume, the system hangs. The display comes up, but it is frozen on the exact frame where I was clicking the "suspend" button on Xfce's logout dialog. The whole system is frozen, not just X, as attempting to summon a text console with Ctrl+Alt+F1 does not work, nor does SysRq (which I'm pretty sure is set up to work in my kernel build).

When this happens I have no choice but to hard-reset, and when I boot it up again, the RAID spends the next 90 minutes "resyncing", resulting in degraded performance and probably some excessive wear on the disks until it finishes.

When I googled this, the most similar things I could find to my problem had the responses say it was probably not the RAID, but to check the graphics card driver, and then the discussion trailed off without being resolved. I know my graphics driver is not the problem as I'm on RadeonSI and this problem did not appear until I added the RAID, all else being unchanged.

I don't know where to start looking, and I don't like the idea of trying to figure it out by experimenting with a bunch of different factors (e.g. trying it with different kernels) as each failed attempt at fixing it will result in a full resync occupying both disks for 90 minutes. I decided to ask here if any of you know what might be causing this. If I can't figure it out without excessive experimentation I may have to just go back to using a regular partition and just use rsync or some such to keep it backed up on the extra drive.

Note that I also have a RAID-0 for some partitions whose contents I don't value as highly, however, I don't think it's to blame. I first noticed this issue (not yet having identified it as RAID-related) before I even made the RAID-0, and it doesn't misbehave after rebooting from the hang as the RAID-1 does.

Here is a possibly excessive amount of potentially useful info:


  • Kernel is built from Gentoo-Sources 4.12.8
  • Using OpenRC init system. Not sure if init is related to suspend/resume, but I know on Systemd based systems it's done with the systemctl command.
  • Output of cat /proc/mdstat:
    Code:
    Personalities : [raid0] [raid1]
    md126 : active raid1 sda6[2] sdb1[1]
          838729728 blocks super 1.2 [2/2] [UU]
          bitmap: 1/7 pages [4KB], 65536KB chunk

    md127 : active raid0 sda8[1] sdb4[0]
          36143104 blocks super 1.2 512k chunks
         
    unused devices: <none>

  • output of lsblk:
    Code:
    NAME          MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
    sdb             8:16   0 931.5G  0 disk 
    ├─sdb4          8:20   0  17.3G  0 part 
    │ └─md127       9:127  0  34.5G  0 raid0
    │   ├─md127p8 259:4    0    23G  0 md    /hddtemp
    │   ├─md127p6 259:2    0     5G  0 md    /home/neil/.cache
    │   ├─md127p7 259:3    0     5G  0 md    /usr/src
    │   ├─md127p5 259:1    0   1.5G  0 md    /usr/portage
    │   └─md127p1 259:0    0     1K  0 md   
    ├─sdb2          8:18   0     8G  0 part  [SWAP]
    ├─sdb3          8:19   0  99.4G  0 part 
    └─sdb1          8:17   0   800G  0 part 
      └─md126       9:126  0 799.9G  0 raid1 /home
    sda             8:0    0 931.5G  0 disk 
    ├─sda7          8:7    0     8G  0 part  [SWAP]
    ├─sda5          8:5    0    16G  0 part  /
    ├─sda3          8:3    0     1K  0 part 
    ├─sda8          8:8    0  17.3G  0 part 
    │ └─md127       9:127  0  34.5G  0 raid0
    │   ├─md127p8 259:4    0    23G  0 md    /hddtemp
    │   ├─md127p6 259:2    0     5G  0 md    /home/neil/.cache
    │   ├─md127p7 259:3    0     5G  0 md    /usr/src
    │   ├─md127p5 259:1    0   1.5G  0 md    /usr/portage
    │   └─md127p1 259:0    0     1K  0 md   
    └─sda6          8:6    0   800G  0 part 
      └─md126       9:126  0 799.9G  0 raid1 /home

  • output of fdisk -l:
    Code:
    Disk /dev/sdb: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 4096 bytes
    I/O size (minimum/optimal): 4096 bytes / 4096 bytes
    Disklabel type: dos
    Disk identifier: 0xb5510a07

    Device     Boot      Start        End    Sectors  Size Id Type
    /dev/sdb1         67108864 1744830463 1677721600  800G da Non-FS data
    /dev/sdb2           262144   17039359   16777216    8G 82 Linux swap / Solaris
    /dev/sdb3       1745035264 1953523711  208488448 99.4G 83 Linux
    /dev/sdb4         17039360   53215231   36175872 17.3G da Non-FS data

    Partition table entries are not in disk order.


    Disk /dev/sda: 931.5 GiB, 1000204886016 bytes, 1953525168 sectors
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 4096 bytes
    I/O size (minimum/optimal): 4096 bytes / 4096 bytes
    Disklabel type: dos
    Disk identifier: 0x9bfd103e

    Device     Boot     Start        End    Sectors  Size Id Type
    /dev/sda3         1050624 1953525167 1952474544  931G  5 Extended
    /dev/sda5        54532096   88086527   33554432   16G 83 Linux
    /dev/sda6       188751872 1866473471 1677721600  800G da Non-FS data
    /dev/sda7         1052672   17829887   16777216    8G 82 Linux swap / Solaris
    /dev/sda8        17831936   54007807   36175872 17.3G da Non-FS data

    Partition table entries are not in disk order.


    Disk /dev/md127: 34.5 GiB, 37010538496 bytes, 72286208 sectors
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 4096 bytes
    I/O size (minimum/optimal): 524288 bytes / 1048576 bytes
    Disklabel type: dos
    Disk identifier: 0x68b1bdf4

    Device       Boot    Start      End  Sectors  Size Id Type
    /dev/md127p1          2048 72286207 72284160 34.5G  5 Extended
    /dev/md127p5          4096  3149823  3145728  1.5G 83 Linux
    /dev/md127p6       3151872 13637631 10485760    5G 83 Linux
    /dev/md127p7      13639680 24125439 10485760    5G 83 Linux
    /dev/md127p8      24127488 72286207 48158720   23G 83 Linux


    Disk /dev/md126: 799.9 GiB, 858859241472 bytes, 1677459456 sectors
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 4096 bytes
    I/O size (minimum/optimal): 4096 bytes / 4096 bytes

  • output of smartctl -i:
    Code:
    smartctl -i /dev/sda
    smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.12.8-gentoo-NUCLEON] (local build)
    Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

    === START OF INFORMATION SECTION ===
    Model Family:     Toshiba 3.5" DT01ACA... Desktop HDD
    Device Model:     TOSHIBA DT01ACA100
    Serial Number:    8324JHBNS
    LU WWN Device Id: 5 000039 ff6ec55eb
    Firmware Version: MS2OA750
    User Capacity:    1,000,204,886,016 bytes [1.00 TB]
    Sector Sizes:     512 bytes logical, 4096 bytes physical
    Rotation Rate:    7200 rpm
    Form Factor:      3.5 inches
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   ATA8-ACS T13/1699-D revision 4
    SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
    Local Time is:    Tue Sep 12 16:06:46 2017 EDT
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled

    new-host-192-168-1-19 /home/neil # smartctl -i /dev/sdb
    smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.12.8-gentoo-NUCLEON] (local build)
    Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

    === START OF INFORMATION SECTION ===
    Model Family:     Western Digital Blue
    Device Model:     WDC WD10EZEX-00MFCA0
    Serial Number:    WD-WCC6Y5XLHLY0
    LU WWN Device Id: 5 0014ee 20efdc5fd
    Firmware Version: 01.01A01
    User Capacity:    1,000,204,886,016 bytes [1.00 TB]
    Sector Sizes:     512 bytes logical, 4096 bytes physical
    Rotation Rate:    7200 rpm
    Form Factor:      3.5 inches
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   ACS-3 T13/2161-D revision 3b
    SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
    Local Time is:    Tue Sep 12 16:06:47 2017 EDT
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled

  • output of dmesg | grep md:
    Code:
    [    1.133208] amd_uncore: AMD NB counters detected
    [    5.567878] EDAC amd64: Node 0: DRAM ECC disabled.
    [    5.567880] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
    [    5.606554] EDAC amd64: Node 0: DRAM ECC disabled.
    [    5.606556] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
    [    5.643065] EDAC amd64: Node 0: DRAM ECC disabled.
    [    5.643068] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
    [    5.689852] EDAC amd64: Node 0: DRAM ECC disabled.
    [    5.689854] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
    [    6.472217] md: array md127 already has disks!
    [    6.502357] md127: detected capacity change from 0 to 37010538496
    [    6.620210]  md127: p1 < p5 p6 p7 p8 >
    [    7.432328] md: array md126 already has disks!
    [    7.445316] md/raid1:md126: active with 2 out of 2 mirrors
    [    7.507787] md126: detected capacity change from 0 to 858859241472
    [    8.674519] XFS (md126): Mounting V5 Filesystem
    [    8.982885] XFS (md126): Ending clean mount
    [    9.180612] XFS (md127p5): Mounting V4 Filesystem
    [    9.249439] XFS (md127p5): Ending clean mount
    [    9.249619] XFS (md127p8): nobarrier option is deprecated, ignoring.
    [    9.264266] XFS (md127p8): Mounting V4 Filesystem
    [    9.472177] XFS (md127p8): Ending clean mount



Last edited by nburgin on Fri Sep 15, 2017 10:00 pm; edited 1 time in total
Back to top
View user's profile Send private message
nburgin
n00b
n00b


Joined: 12 Sep 2017
Posts: 6

PostPosted: Fri Sep 15, 2017 7:43 am    Post subject: Reply with quote

I tried switching from mdadm RAID1 to LVM RAID1. It still hangs upon resuming, but no longer does the lengthy resync process every time this happens. While it's normally hard to put a good spin on less rigorous error recovery, I'm honestly glad it doesn't because it doesn't seem like it was necessary.

Since the hang happens both with MD raids and LVM raids, I'm assuming it's the fault of the underlying kernel raid support, rather than the userspace tools. I don't know what to do about that though, as I'm not exactly an expert at kernel debugging.

Also, after getting rid of the MD raid1, but before setting up the LVM raid1 (it was still a plain linear LVM), I tried suspending and it worked correctly. The fact that the MD RAID0 was running didn't seem to interfere with that...
Back to top
View user's profile Send private message
nburgin
n00b
n00b


Joined: 12 Sep 2017
Posts: 6

PostPosted: Fri Sep 15, 2017 9:33 pm    Post subject: Reply with quote

Some more intensive googling found something I missed the first time, apparently this is a buggy interaction between RAID, suspend/resume, and the new blk-mq IO subsystem. I had enabled blk-mq because it is required by the mainlined version of BFQ, but apparently its newness meant that there were still a few bugs not resolved even after it was included in the mainline kernel.

Apparently someone else reported this bug and a patch was prepared by someone named Ming Lei around the end of August, but I assume the patch apparently hasn't been upstreamed yet since the problem still exists as of 4.13.2

I am currently recompiling my kernel to use the old single-queue block subsystem and CFQ, and will try again. If this mailing list thing I found is as relevant to my problem as I think it is, that'll probably fix it. Maybe I'll try MQ again when 4.14 comes out.
Back to top
View user's profile Send private message
nburgin
n00b
n00b


Joined: 12 Sep 2017
Posts: 6

PostPosted: Fri Sep 15, 2017 10:00 pm    Post subject: Reply with quote

Yep, that fixed it.

I guess I'll mark this as solved then.

I don't really blame y'all for not offering any advice. I was wondering if anyone who knew better than I did could figure out if I misconfigured something, but dumping a bunch of output from various utilities is, in hindsight, probably not the best way to ask a question. And since it turned out to be an obscure bug rather than a misconfiguration, I don't think there's much advice you could have given that would have helped anyway.

EDIT: Does anyone know if the original (non-multiqueue) BFQ patch is still being maintained, now that the new blk-mq version has been mainlined? I think it used to be included as one of the Genpatches for Gentoo-Sources, but stopped being included once the blk-mq version of BFQ was mainlined.

EDIT 2: Never mind, I'll start a new thread instead since probably no one reads threads that are already marked solved.

[Moderator note: for the benefit of future readers: the new thread is Mainline BFQ versus the old external patch. -Hu]
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum