Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
SATA HW errors with AMD SB950 controller [workaround]
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1518
Location: KUUSANKOSKI, Finland

PostPosted: Fri Apr 07, 2017 10:29 pm    Post subject: SATA HW errors with AMD SB950 controller [workaround] Reply with quote

I bought a new motherboard. I was about to chroot from live os and then recompile kernel to fit the new hardware. But...

Let me first explain my hard drive setup:
Six SSDs. I have bought maximum of two at the time. All are set as eSATA (hotswap) because those all reside in 5.25" bay hotswap cage. The cage itself is a simple passtrough device. It only has indicator leds for each drive. And them to work the drive needs to support an activity led.
lsblk /dev/sd{a..f}:
NAME      MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda         8:0    1 238.5G  0 disk 
├─sda1      8:1    1   512M  0 part 
├─sda2      8:2    1   3.5G  0 part 
│ └─md126   9:126  0  17.4G  0 raid5
└─sda3      8:3    1 234.5G  0 part
sdb         8:16   1 111.8G  0 disk 
├─sdb1      8:17   1   512M  0 part 
├─sdb2      8:18   1   3.5G  0 part 
│ └─md126   9:126  0  17.4G  0 raid5
└─sdb3      8:19   1 107.8G  0 part 
sdc         8:32   1 489.1G  0 disk 
├─sdc1      8:33   1   512M  0 part 
├─sdc2      8:34   1   3.5G  0 part 
│ └─md126   9:126  0  17.4G  0 raid5
└─sdc3      8:35   1 485.1G  0 part 
sdd         8:48   1 447.1G  0 disk 
├─sdd1      8:49   1   512M  0 part 
├─sdd2      8:50   1   3.5G  0 part 
│ └─md126   9:126  0  17.4G  0 raid5
└─sdd3      8:51   1 443.1G  0 part 
sde         8:64   1 447.1G  0 disk 
├─sde1      8:65   1   512M  0 part 
├─sde2      8:66   1   3.5G  0 part 
│ └─md126   9:126  0  17.4G  0 raid5
└─sde3      8:67   1 443.1G  0 part 
sdf         8:80   1 447.1G  0 disk 
├─sdf1      8:81   1   512M  0 part 
├─sdf2      8:82   1   3.5G  0 part 
│ └─md126   9:126  0  17.4G  0 raid5
└─sdf3      8:83   1 443.1G  0 part
  • First partition of each device was a /boot partition on mdraid1. I'm not sure if I have lost that raid stack... All the drives appers as spares now. At first boot, at least, the partition was avalable
  • md126 is/was my swap partition on raid5 for hibernate image. I reformatted it to ext4 to make test. I dumped data from /dev/urandom to (almost) fill the partition. The data going in and out from the partition had the same md5sum (dropping write cache in between). While I did that I didn't receive any errors. However eralier srubbing the raid device did produce errors on at least four sata busses
  • The last, third, partition of each device is btrfs filesystem. Reading and writing to it works. Although I've been mounting it ro since I started to investigate this problem. I have backups there also which I have backupped further into my server.


The problem
I enounter lots of ata errors. But somehow things do still work. The only exception was that I lost the (six drive) raid1 array as spares. Before when that raid1 array worked it usually got stuck when trying to umount it, but the system didn't froze. I haven't tried to assemble it yet. The data might be there, but I also have backups. Also it's only /boot.

Here's some more information:
Motherboard: ASRock 970M Pro3
root@livecd # uname -a:
Linux livecd 4.5.2-aufs-r1 #1 SMP Sun Jul 3 17:17:11 UTC 2016 x86_64 AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux
lspci:
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (external gfx0 port B) (rev 02)
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD/ATI] RD990 I/O Memory Management Unit (IOMMU)
00:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (PCI express gpp port B)
00:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (PCI express gpp port D)
00:09.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (PCI express gpp port H)
00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] (rev 40)
00:12.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
00:12.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
00:13.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
00:13.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller (rev 42)
00:14.2 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA) (rev 40)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller (rev 40)
00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge (rev 40)
00:14.5 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller
00:15.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
00:15.3 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] SB900 PCI to PCI bridge (PCIE port 3)
00:16.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
00:16.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 5
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev ca)
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Fiji HDMI/DP Audio Controller
02:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev 20)
03:00.0 USB controller: Etron Technology, Inc. EJ188/EJ198 USB 3.0 Host Controller
06:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)
dmesg | grep 'ata[0-9]':
[    1.723018] ata1: SATA max UDMA/133 abar m1024@0xfeb0b000 port 0xfeb0b100 irq 19
[    1.723021] ata2: SATA max UDMA/133 abar m1024@0xfeb0b000 port 0xfeb0b180 irq 19
[    1.723023] ata3: SATA max UDMA/133 abar m1024@0xfeb0b000 port 0xfeb0b200 irq 19
[    1.723025] ata4: SATA max UDMA/133 abar m1024@0xfeb0b000 port 0xfeb0b280 irq 19
[    1.723027] ata5: SATA max UDMA/133 abar m1024@0xfeb0b000 port 0xfeb0b300 irq 19
[    1.723029] ata6: SATA max UDMA/133 abar m1024@0xfeb0b000 port 0xfeb0b380 irq 19
[    2.178802] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.179794] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.179810] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.179825] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.179842] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.180125] ata3.00: supports DRM functions and may not be fully accessible
[    2.180279] ata1.00: ATA-9: SAMSUNG SSD 830 Series, CXM03B1Q, max UDMA/133
[    2.180281] ata1.00: 500118192 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
[    2.180484] ata3.00: ATA-10: Crucial_CT525MX300SSD1,  M0CR040, max UDMA/133
[    2.180486] ata3.00: 1025610768 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
[    2.180559] ata1.00: configured for UDMA/133
[    2.180792] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    2.181071] ata6.00: ATA-11: KINGSTON SUV400S37480G, 0C3FD6SD, max UDMA/133
[    2.181073] ata6.00: 937703088 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
[    2.181197] ata3.00: supports DRM functions and may not be fully accessible
[    2.181452] ata6.00: configured for UDMA/133
[    2.182044] ata3.00: configured for UDMA/133
[    2.186906] ata2.00: ATA-8: KINGSTON SV300S37A120G, 600ABBF0, max UDMA/133
[    2.186908] ata2.00: 234441648 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
[    2.186991] ata5.00: ATA-8: KINGSTON SV300S37A480G, 605ABBF2, max UDMA/133
[    2.186993] ata5.00: 937703088 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
[    2.187450] ata4.00: ATA-8: KINGSTON SV300S37A480G, 605ABBF2, max UDMA/133
[    2.187452] ata4.00: 937703088 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
[    2.192499] ata5.00: configured for UDMA/133
[    2.192933] ata4.00: configured for UDMA/133
[    2.193052] ata2.00: configured for UDMA/133
[    2.220890] ata3.00: Enabling discard_zeroes_data
[    2.221124] ata3.00: Enabling discard_zeroes_data
[    2.221476] ata3.00: Enabling discard_zeroes_data
[   14.640192] ata6.00: exception Emask 0x10 SAct 0x1800000 SErr 0x400000 action 0x6 frozen
[   14.640194] ata6.00: irq_stat 0x08000000, interface fatal error
[   14.640196] ata6: SError: { Handshk }
[   14.640199] ata6.00: failed command: WRITE FPDMA QUEUED
[   14.640202] ata6.00: cmd 61/80:b8:00:bc:81/00:00:0b:00:00/40 tag 23 ncq 65536 out
[   14.640204] ata6.00: status: { DRDY }
[   14.640206] ata6.00: failed command: WRITE FPDMA QUEUED
[   14.640208] ata6.00: cmd 61/80:c0:80:bc:81/00:00:0b:00:00/40 tag 24 ncq 65536 out
[   14.640210] ata6.00: status: { DRDY }
[   14.640212] ata6: hard resetting link
[   15.096160] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   15.096836] ata6.00: configured for UDMA/133
[   15.096843] ata6: EH complete
[   15.108166] ata6.00: exception Emask 0x10 SAct 0x6 SErr 0x400000 action 0x6 frozen
[   15.108168] ata6.00: irq_stat 0x08000000, interface fatal error
[   15.108169] ata6: SError: { Handshk }
[   15.108171] ata6.00: failed command: WRITE FPDMA QUEUED
[   15.108174] ata6.00: cmd 61/80:08:00:bc:81/00:00:0b:00:00/40 tag 1 ncq 65536 out
[   15.108176] ata6.00: status: { DRDY }
[   15.108177] ata6.00: failed command: WRITE FPDMA QUEUED
[   15.108180] ata6.00: cmd 61/80:10:80:bc:81/00:00:0b:00:00/40 tag 2 ncq 65536 out
[   15.108181] ata6.00: status: { DRDY }
[   15.108184] ata6: hard resetting link
[   15.564138] ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   15.564811] ata6.00: configured for UDMA/133
[   15.564815] ata6: EH complete
This snippet shows errors for ata6, but I've encountered the same errors for other drives/buses too.

I don't want to admit it but I think it's the SATA controller... Swapping back the old motherbord in would take some time... I wish I had some kind of test bench but I don't.

Finally: the whole dmesg from one boot.
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...


Last edited by Zucca on Mon Apr 10, 2017 6:28 pm; edited 1 time in total
Back to top
View user's profile Send private message
roarinelk
Guru
Guru


Joined: 04 Mar 2004
Posts: 501

PostPosted: Sat Apr 08, 2017 9:15 am    Post subject: Reply with quote

I'd say it's the cables, or the connectors in the hotswap bay:

[ 15.108169] ata6: SError: { Handshk }

These errors appear when the data on the wires is bad (bitflips, crc errors, )
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1518
Location: KUUSANKOSKI, Finland

PostPosted: Sat Apr 08, 2017 9:36 am    Post subject: Reply with quote

roarinelk wrote:
I'd say it's the cables, or the connectors in the hotswap bay

I'll check the cables again. I have bought four new SATA cables. I bought them because I already had two of them and those were short enough and flexible... I do have spare SATA cables, so I'll start experimenting. Thanks.
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1518
Location: KUUSANKOSKI, Finland

PostPosted: Sun Apr 09, 2017 4:35 pm    Post subject: Reply with quote

I've now changed the cables for ata5 and ata6.

dmesg showed those familiar errors from ata5 during boot. dmesg reported errors as I did a scrub... however doing the scrub for a second time didn't produce any errors. Scrubbing has always been succesful. No matter how much errors show up in dmesg.

This just does not make any sense...
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1518
Location: KUUSANKOSKI, Finland

PostPosted: Sun Apr 09, 2017 9:22 pm    Post subject: Reply with quote

I think I've found a pattern: All the errors stop after kernel limits the badwidth of the sata bus.
snip from dmesg:
[  711.235882] ata6: limiting SATA link speed to 3.0 Gbps
[  711.235887] ata6.00: exception Emask 0x12 SAct 0x3800 SErr 0x500 action 0x6 frozen
[  711.235889] ata6.00: irq_stat 0x08000000, interface fatal error
[  711.235890] ata6: SError: { UnrecovData Proto }
[  711.235893] ata6.00: failed command: READ FPDMA QUEUED
[  711.235896] ata6.00: cmd 60/40:58:f8:50:1d/05:00:00:00:00/40 tag 11 ncq 688128 in
                        res 40/00:5c:f8:50:1d/00:00:00:00:00/40 Emask 0x12 (ATA bus error)
[  711.235898] ata6.00: status: { DRDY }
[  711.235899] ata6.00: failed command: READ FPDMA QUEUED
[  711.235902] ata6.00: cmd 60/40:60:38:56:1d/05:00:00:00:00/40 tag 12 ncq 688128 in
                        res 40/00:5c:f8:50:1d/00:00:00:00:00/40 Emask 0x12 (ATA bus error)
[  711.235903] ata6.00: status: { DRDY }
[  711.235905] ata6.00: failed command: READ FPDMA QUEUED
[  711.235907] ata6.00: cmd 60/40:68:78:5b:1d/00:00:00:00:00/40 tag 13 ncq 32768 in
                        res 40/00:5c:f8:50:1d/00:00:00:00:00/40 Emask 0x12 (ATA bus error)
[  711.235909] ata6.00: status: { DRDY }
[  711.235911] ata6: hard resetting link
[  711.895866] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[  711.896440] ata6.00: configured for UDMA/133
After badwidth limit, the errors appear for the final time. After the errors, the link hard resets to 3 Gbps. And errors no longer appear.

What is going on?
I've now switched the cables on ata[4-6] to different brand ones. They seem to have no visible effect.
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
frostschutz
Advocate
Advocate


Joined: 22 Feb 2005
Posts: 2970
Location: Germany

PostPosted: Sun Apr 09, 2017 9:33 pm    Post subject: Reply with quote

You can use libata.force kernel parameter to limit specific interfaces to 1.5 or 3.0 Gbps, if lower speeds resolve your issues that might silence the errors. You can also use it to disable ncq, some controllers/drives don't handle properly.

As for the cause, it could be anything from controller, drive, whatever you have in between those, or even a kernel bug.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1518
Location: KUUSANKOSKI, Finland

PostPosted: Mon Apr 10, 2017 10:55 am    Post subject: Reply with quote

frostschutz wrote:
You can use libata.force kernel parameter to limit specific interfaces to 1.5 or 3.0 Gbps, if lower speeds resolve your issues that might silence the errors. You can also use it to disable ncq, some controllers/drives don't handle properly.
When you mentioned ncq, I though it must be it. Most of the error messages are related to it. However, it didn't make a difference. But forcing bandwidth to 3Gbps "solved" the issue.

frostschutz wrote:
As for the cause, it could be anything from controller, drive, whatever you have in between those, or even a kernel bug.
I can safely rule out the drives, as they all (6) worked flawlessly on my previous MB. I also have swapped several cables around and those didn't have any visible effect.

The SATA controller is AMD SB950. There has been some problems with it. Also see the source of the information. I'm not sure if that bug is related... If it is, then I wonder if clocksource=tsc would resolve the issue. What's the drawback of using tsc compared to hpet?
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1518
Location: KUUSANKOSKI, Finland

PostPosted: Mon Apr 10, 2017 6:28 pm    Post subject: Reply with quote

So far I haven't found any other solution to this problem than to add libata.force=3Gbps to kernel command line.
I read somewhere that this problem may only occur if there's three or more devices attached to the SATA controller.

If anyone, who stumbles here, has a better solution, please post. For now I'll use this "hack".

Changing the topic from "New MB - almost random (frequent) ata errors" to "SATA HW errors with AMD SB950 controller [workaround]".
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum