Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[solved] System hanging, drive issue or SATA controller?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
superjaded
l33t
l33t


Joined: 05 Jul 2002
Posts: 784

PostPosted: Wed Dec 11, 2019 6:40 pm    Post subject: [solved] System hanging, drive issue or SATA controller? Reply with quote

The past few days I've been experiencing some really weird issues that look to be related to my storage. I'll be doing whatever -- browsing the web or other things which should not be too system intensive, and my system will sporadically hang for what seems like a minute or two. Sometimes it hangs indefinitely, responsive enough that I can drop to a tty, but then it just sits there after entering my username. The first time I noticed this happen was actually in the morning the first time I touched my computer in the morning, indicating it started hanging when it wasn't doing much more than background processes.

I've seen these errors in both kernel versions 5.4.2-gentoo and 5.3.15-gentoo. I started seeing when I was already primarily using 5.4.2, so I reinstalled 5.3.15 since I was initially thinking maybe there was an issue with ZFS and 5.4.x.. but still saw the errors in 5.3.15.

I've also started noticing the following errors pop up in my logs:
Code:
Dec 09 16:13:31 [kernel] [31268.294640] ata5.00: irq_stat 0x00400040, connection status changed
Dec 09 16:13:31 [kernel] [31268.294641] ata5: SError: { HostInt PHYRdyChg 10B8B DevExch }
Dec 09 16:13:31 [kernel] [31268.294642] ata5.00: failed command: FLUSH CACHE EXT
Dec 09 16:13:31 [kernel] [31268.294645] ata5.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 20
Dec 09 16:13:31 [kernel] [31268.294645]          res 40/00:28:40:81:29/00:00:03:00:00/40 Emask 0x50 (ATA bu
s error)
Dec 09 16:13:31 [kernel] [31268.294645] ata5.00: status: { DRDY }
Dec 09 16:13:31 [kernel] [31268.294648] ata5: hard resetting link
Dec 09 16:13:37 [kernel] [31274.057658] ata5: link is slow to respond, please be patient (ready=0)
Dec 09 16:13:39 [kernel] [31276.294660] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 09 16:13:40 [kernel] [31276.915658] ata5.00: supports DRM functions and may not be fully accessible
Dec 09 16:13:40 [kernel] [31276.916097] ata5.00: NCQ Send/Recv Log not supported
Dec 09 16:13:40 [kernel] [31276.917136] ata5.00: supports DRM functions and may not be fully accessible
Dec 09 16:13:40 [kernel] [31276.917470] ata5.00: NCQ Send/Recv Log not supported
Dec 09 16:13:40 [kernel] [31276.918269] ata5.00: configured for UDMA/133
Dec 09 16:13:40 [kernel] [31276.918273] ata5.00: retrying FLUSH 0xea Emask 0x50
Dec 09 16:13:40 [kernel] [31276.918310] ata5: EH complete


Doing googles on those errors I'm getting above seem to generally point to kernel bugs rather than necessarily being indicative of iminent hardware failure. But of course those bug reports / forum threads are mostly from 2010 or earlier, so not sure if those are even relevant.

I've noticed these kinds of errors for both "ata5.00" and "ata6.00," which if I'm understanding the relation to the actual harddrives correctly --
ata5 is a Samsung 850 EVO 1TB drive I bought in Aug of 2018
ata6 is a Samsung 850 Evo 250GB that I bought in Aug of 2015

The 250GB drive is my drive (e.g. boot, root) for gentoo, and the 1TB drive is part of a striped zpool with a newer 1TB Samsung 860 Evo.

Probably also worth mentioning that all the drives in the system are encrypted via luks.

Of course, there is definitely something going on. I booted back into 5.4.2 this morning and launched a VM which was the last "configuration" I was in before I really started noticing this issue. Not long after I booted I saw several write errors to the 1TB drive. It looks like ZFS noticed the issues so now the fastdata zpool is in a degraded state. That was the first time I saw an error like that.

What I'm not really sure about is what those ata errors are actually pointing to -- is it just a simple issue of both of those SSDs are going bad at the same time, or is it possible there is also an issue with the SATA controller that those HDs are connected to?

Probably worth mentioning that in total I have 11 SSDs or HDs in my system. 3 SSDs using my motherboard's SATA controller, and 8 spinning rust HDs which are connected to an LSI SAS9211 expansion card. As far as I can tell, I'm not (yet) having any problems with the third SSD (which is the other 1TB drive in the aforementioned fastdata zpool) or the 8 mechanical HDs. Those drives I bought within the last few months, while the two SSDs I seem to be having issues with are getting up there in age, especially the 250GB one.

I'm also not seeing any SMART errors on any drive. I've also tried the "short" test on all the drives and they all passed.

Any ideas would be helpful or if any other info might be needed. I'm assuming the SSDs are going bad, but the errors I'm seeing have me confused whether it's JUST an issue with the drives or if the SATA controller on my motherboard is going kaput too.

In case it helps, here is some general info about my system:
i7 6700k (stock speed presently)
ASUS Z170-E motherboard
32GB DDR4 memory
MSI Radeon RX580 8GB
Corsair RM850 PSU
SSDs using motherboard SATA controller:
250GB Samsung 850 EVO
1TB Samsung 850 EVO
1TB Samsung 860 EVO
HDs using LSI SAS9211 expansion card:
8x 10TB Seagate Ironwolf drives in raidz2
SB Audigy
Fresco Logic FL1100 USB 3.0 (usb expansion card)


Last edited by superjaded on Thu Dec 12, 2019 1:38 am; edited 1 time in total
Back to top
View user's profile Send private message
apiaio
Apprentice
Apprentice


Joined: 04 Dec 2008
Posts: 218

PostPosted: Wed Dec 11, 2019 7:49 pm    Post subject: Reply with quote

According your hardware, as you described I suppose that you have used PC desktop. I had similar problem in the past.

Please check the temperature of MB and processor. If there is something queer, open the box and clean it with blower or dust collector.
Mainly coolers.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 44921
Location: 56N 3W

PostPosted: Wed Dec 11, 2019 8:24 pm    Post subject: Reply with quote

superjaded,

Please post
Code:
smartctl -a /dev/<whole_device>

You will need smartmontools.

Code:
ATA bus error
may be external to the drive. Poor quality SATA data cables are a common cause.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Anon-E-moose
Advocate
Advocate


Joined: 23 May 2008
Posts: 4399
Location: Dallas area

PostPosted: Wed Dec 11, 2019 8:46 pm    Post subject: Reply with quote

NeddySeagoon wrote:
superjaded,

Please post
Code:
smartct- -a /dev/<whole_device>

You will need smartmontools.

Code:
ATA bus error
may be external to the drive. Poor quality SATA data cables are a common cause.


I've had the sata cables become slightly loose and cause ata errors.
_________________
Asus m5a99fx, FX 8320 - nouveau, oss4, rx550 for qemu passthrough
Acer laptop E5-575, i3-7100u - i965, alsa
---both---
5.0.13 zen kernel, profile 17.1 (no-pie & modified) amd64-no-multilib
gcc 8.2.0, eudev, openrc, openbox, palemoon
Back to top
View user's profile Send private message
superjaded
l33t
l33t


Joined: 05 Jul 2002
Posts: 784

PostPosted: Wed Dec 11, 2019 9:51 pm    Post subject: Reply with quote

NeddySeagoon wrote:
superjaded,

Please post
Code:
smartct- -a /dev/<whole_device>

You will need smartmontools.



  • 850 Evo 250GB (ata5 in the previous logs)
  • 850 Evo 1TB (ata6 in the previous logs): Seen the most ata errors, and this was the drive where I saw actual write errors
  • 860 Evo: I haven't seen any errors for this one yet, but figured why not.



Quote:
Code:
ATA bus error
may be external to the drive. Poor quality SATA data cables are a common cause.


Shoulda tried that already; thanks for the ideas. I should have some extra SATA cables, so at the very least I can try swapping those out and/or making sure they didn't get knocked loose somehow.

apiaio wrote:
According your hardware, as you described I suppose that you have used PC desktop. I had similar problem in the past.

Please check the temperature of MB and processor. If there is something queer, open the box and clean it with blower or dust collector.
Mainly coolers.


Didn't notice this one earlier, but I don't think it's necessarily a temperature issue. My temps do get a little higher than I'd like during compiles (but usually not significantly higher than 80*C for too long) when my utilization is pegged at 100%, but most of the time my CPU temps don't usually go much above 30*C in most circumstances. Mobo temps are usually at least 5-10*C lower than that.

My case/fans are new enough that they haven't had time to build up too much dust yet. ;) It's going to be fun trying to dust this case out when that does become necessary.

EDIT: Slight change of plans -- since I haven't had any errors for my 860 EVO yet, I took both of the 1TB drives out and hooked my 250GB SSD up to the cables that the 860 was connected to. Presumably if it's as simple as a cable/port issue, I shouldn't get any errors on my OS drive. If I go a while without errors, I'll start experimenting adding the other SSDs back.
Back to top
View user's profile Send private message
Ant P.
Watchman
Watchman


Joined: 18 Apr 2009
Posts: 6342

PostPosted: Wed Dec 11, 2019 10:45 pm    Post subject: Reply with quote

That it's happening to multiple different drives suggests that the fault isn't in those, so don't worry about your data too much.

Start with the cheapest part first and work up from there. Try different SATA ports in the mobo too, but beware of drive renumbering - if you use PARTUUIDs everywhere this won't be a hassle.

You say this is a new-ish PC, is the power supply strong enough for it? Web browsing is GPU-heavy nowadays but it won't show that in things like htop.
Back to top
View user's profile Send private message
superjaded
l33t
l33t


Joined: 05 Jul 2002
Posts: 784

PostPosted: Thu Dec 12, 2019 1:37 am    Post subject: Reply with quote

I think this issue is resolved. I had my system running with just my 250GB SSD for at least two hours with no errors or system hangs. I plugged my other two SSDs back into my system using different ports and SATA cables. My system has been up for around an hour so far without any of the errors I was getting before. It's nice when it's a simple fix and me making things more complicated than they need to be.

I did end up having to recreate my fastdata pool to clear up the "degraded" status due to the aforementioned write errors and since the pool was created without any redundancy anyway. That's not really a big deal since the pool just had VM images and Steam games on it, and I was kind of looking for an excuse to create a new gentoo testing VM. ;)

Ant P. wrote:
Start with the cheapest part first and work up from there. Try different SATA ports in the mobo too, but beware of drive renumbering - if you use PARTUUIDs everywhere this won't be a hassle.


Yep, I knew there was a reason why using UUIDs or /dev/disk/by-id/ where possible was a good choice. ;)

Quote:
You say this is a new-ish PC,..


I suppose it's new in the grand scheme of things. The system was originally built at the end of 2016, but I've replaced/upgraded parts over the years as it made sense. Think at this point the CPU is one of the few things that was part of that original build.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum