Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Isolating memory failures
View unanswered posts
View posts from last 24 hours

Goto page 1, 2  Next  
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
ExecutorElassus
Veteran
Veteran


Joined: 11 Mar 2004
Posts: 1182
Location: Stuttgart, Germany

PostPosted: Thu Mar 14, 2019 12:40 pm    Post subject: Isolating memory failures Reply with quote

I've had a randomly recurring problem with system freezes, usually when doing Gaming Things (shoutout to wine's excellent progress with supporting modern games). I ran furmark without issue, and ran 'stress' on the CPU, also without issue. However, running memtest86+ caused the computer to switch off and restart, usually about 10% into the test. I pulled out all the RAM sticks, and put them in one by one, getting the same problem (at different points, lasting longer if I told memtest to use SMP) on every stick in any slot.

This was suggested to me elsewhere to be an fault with the mobo's RAM bus. Is there a way to isolate this better? The problem with system freezes occurs more frequently when it's hot out, which suggests to me some component on the mobo is burned, but I'd like to know a bit better before I plunk down another €500 for a new CPU/mobo combo.

Cheers,

EE
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 7159
Location: almost Mile High in the USA

PostPosted: Thu Mar 14, 2019 3:31 pm    Post subject: Reply with quote

If the computer powers down during running a RAM test, I'd look into power problems.
Since you mention that it fails more often during hot weather, you should see if the cooling system is working. Ensure everything is clean of dust.

Your RAM is probably fine, but motherboard, power supply (including the motherboard ones) are suspect.
_________________
Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
ExecutorElassus
Veteran
Veteran


Joined: 11 Mar 2004
Posts: 1182
Location: Stuttgart, Germany

PostPosted: Thu Mar 14, 2019 5:23 pm    Post subject: Reply with quote

CPU and GPU are on their own water loop with an external radiator. The case is vertically oriented (that is, the ports are on the top of the case, not the back, so airflow moves bottom-to-top over the cards).

I was told that PSUs only last "5-8 years," and this one's over a decade old. Is there any way to test that the PSU is faulty besides swapping it out?

Cheers,

EE
(this would be, honestly, a preferably diagnosis to a bad mobo, because the PSU is cheap to replace and I'd have to replace the entire CPU/RAM/mobo set otherwise)
Back to top
View user's profile Send private message
Jaglover
Watchman
Watchman


Joined: 29 May 2005
Posts: 7128
Location: Saint Amant, Acadiana

PostPosted: Thu Mar 14, 2019 5:49 pm    Post subject: Reply with quote

You could measure the voltages on ATX connector, don't unplug it, it is important to measure under load. However, this test would not reveal if some rail provides "dirty" power. But then again, if it is that old why not replace it. There are PSU testers, but the cheap ones are useless, they do not put any load to the PSU.
_________________
Please learn how to denote units correctly!
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43368
Location: 56N 3W

PostPosted: Thu Mar 14, 2019 6:52 pm    Post subject: Reply with quote

ExecutorElassus,

If its PSU, you need to test the dynamic regulation. That's hard.
The problem is that the voltages need to stay within spec when the CPU goes from almost nothing to full power in one CPU clock.
With a 3 GHz CPU clock that's not very long. (3.33ps)

It gets worse. The CPU and memory subsystem have their own on the motherboard PSU. This takes 12v out of the tin can PSU and turns it into the voltages required by the CPU and memory.
This bit gets a very hard life and as result, fails more often than the PSU you are thinking of replacing.
At over 10 years old, if the rest of the system is of an age with the PSU, failures here can be often be spotted with Mk1 eyeball.
Look at the capacitors around the CPU. Be sure that they are not leaking, bulging, or tipped over. That's all signs of failure.
Replacing these parts, they must all be done together, is a job requiring intermediate soldering skills.

So far, you have identified a systems problem that is probably not RAM.
Test the RAM with mentest86+ in another system.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
ExecutorElassus
Veteran
Veteran


Joined: 11 Mar 2004
Posts: 1182
Location: Stuttgart, Germany

PostPosted: Thu Mar 14, 2019 7:03 pm    Post subject: Reply with quote

Hi Neddy!

the mobo/CPU/RAM were all from 2016; the GPU from 2009, the PSU from around 2008.

The intermittent problem I'm having is that the system will freeze, requiring hard reboot. It happens more often during warm weather, and more often when doing memory-intensive things (or, at least I assume that's the case since the game I play only uses one core really and doesn't do a lot of HDD writes).

As I said, another contact suggested the RAM bus is failing. But maybe it's the PSU on the mobo?

I'll have to ask around to see if I can find anybody who even has a desktop.

Is there any other way to test besides finding another machine?

Cheers,

EE
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43368
Location: 56N 3W

PostPosted: Thu Mar 14, 2019 7:17 pm    Post subject: Reply with quote

ExecutorElassus,

In 2016, the memory controller was built into the CPU. The bus is the tracking and terminating resistors at each end.
These resistors are on the RAM sticks at the RAM stick end and on the motherboard at the CPU end.
Asumming you have 4 or 6 memory sockets, then 4 or 6 parts of the RAM bus have failed.
That's unlikely.

Can you post images of the region of the motherboard around the CPU?
Don't remove the CPU heatsink yet, lets see what we can see first.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
ExecutorElassus
Veteran
Veteran


Joined: 11 Mar 2004
Posts: 1182
Location: Stuttgart, Germany

PostPosted: Thu Mar 14, 2019 7:55 pm    Post subject: Reply with quote

Hi Neddy,

the best I could do without cracking the case is this photo. Sorry for all the tubing (and yes, the green coolant suggests to me that the GPU could probably do with a replacement; I'm gonna try that by year's end).

See anything useful there?

Cheers,

EE
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43368
Location: 56N 3W

PostPosted: Thu Mar 14, 2019 8:27 pm    Post subject: Reply with quote

ExecutorElassus,

Its difficult to see in that image.

There are 11 tubular silver things at the top and top right side of the water block.
There are more above the finned black and red heatsink. Thats the bits we need to see.

The finned heatsink carries the switching transistors for the CPU power supply.

The black D on the tops indicates polarity. That's important if/when you come to replace the parts.

The connector in the top right of the image, with the black and yellow wires is the 12v input to the CPU PSU.
My connector is charred and it has most of the plastic missing from the PSU cable. Every now and again it goes high resistance and I nave to clean it.
Yours looks OK. Its worth pulling in apart to inspect.

Better images would be useful.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
ExecutorElassus
Veteran
Veteran


Joined: 11 Mar 2004
Posts: 1182
Location: Stuttgart, Germany

PostPosted: Tue Mar 19, 2019 9:22 am    Post subject: Reply with quote

Hi Neddy,

I finally cracked the case open and got some better photos. I don't see any obvious damage here, and the cable appears to be properly seated (though there's a second cable, lower down and smaller, going into the mobo that I didn't check). Can you see anything here that looks like an obvious culprit?

Thanks for the help,

EE
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43368
Location: 56N 3W

PostPosted: Tue Mar 19, 2019 10:21 am    Post subject: Reply with quote

ExecutorElassus,

Both photos look good.

-- edit --

All look good. - Missed one.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
ExecutorElassus
Veteran
Veteran


Joined: 11 Mar 2004
Posts: 1182
Location: Stuttgart, Germany

PostPosted: Wed Mar 20, 2019 6:12 pm    Post subject: Reply with quote

all right. So should I go back to swapping out the PSU, or is there some other check I can make to try to narrow it down?

Cheers,

EE
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43368
Location: 56N 3W

PostPosted: Wed Mar 20, 2019 6:24 pm    Post subject: Reply with quote

ExecutorElassus,

Testing by substitution is indeed the next step.
Order doesn't matter but just one thing at a time. If you have a PSU to hand, swap it.
Likewise, try parts from this system elsewhere.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
artbody
Guru
Guru


Joined: 15 Sep 2006
Posts: 431
Location: LB

PostPosted: Thu Mar 21, 2019 3:50 pm    Post subject: Reply with quote

I don't know a lot about PSU's
but i've always (on my PC)

sys-apps/lm_sensors

and GKrellM for visualisation
installed and configured,
so i can always see what temperature the GPU and CPU has.

the other thing i would suggest is a memtest
_________________
Never give up
WM : E16 layman -o https://raw.githubusercontent.com/mcclung/e16-overlay/master/repositories.xml -f -a e16-overlay
achim
Back to top
View user's profile Send private message
ExecutorElassus
Veteran
Veteran


Joined: 11 Mar 2004
Posts: 1182
Location: Stuttgart, Germany

PostPosted: Fri Mar 22, 2019 5:54 am    Post subject: Reply with quote

hi artbody,

I have gkrellm running. Neither CPU nor GPU ever get near redline (CPU maxes out around 55°C under 100% load, occasionally spiking to a bit over 60°C; GPU never gets above about 45°C).

As I said upthread, running memtest causes the machine to switch off and restart.

Unfortunately, I don't have a spare machine, and I don't know anybody who has a PC. I could maybe ask at the University if the lab has a spare PSU they could lend, and try that out. I'll let you know.

Cheers,

EE
Back to top
View user's profile Send private message
ExecutorElassus
Veteran
Veteran


Joined: 11 Mar 2004
Posts: 1182
Location: Stuttgart, Germany

PostPosted: Thu Apr 04, 2019 11:42 am    Post subject: Reply with quote

Update: I installed a new PSU. Same problem with memtest. I don't know yet if the machine freezes (it does it randomly). However, I also dusted off all the fans, and now both CPU and GPU temps are much lower, even when gaming.

I wonder if the problem might be heat-related. With both the CPU and GPU water-cooled, they don't register very high temperature, and since the case fans are motherboard-controlled, maybe there's not enough airflow in the case and some other component is overheating? It seems to freeze more when it's hot out, and less after I clean the fan filters. That still doesn't explain the switch-off running memtest.

But in any case, with a new PSU I still don't know what the problem is, because it evidently isn't solved completely yet. Neddy, any idea what to try next? I was going to see if I could borrow a vid card from the University lab, and see if maybe that might be the problem. The GPU is now the oldest component (it's almost 9 years old), and I heard that vid card problems can affect memtest.

Any other suggestions?

Cheers,

EE
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43368
Location: 56N 3W

PostPosted: Thu Apr 04, 2019 8:14 pm    Post subject: Reply with quote

ExecutorElassus,

Try turning off Message Signalled Interrupts. Add
Code:
pci=nomsi
to your kernel line in grub.conf.
You get a small performance penalty for that.

Normal IRQs and MSI work quite differently.
It the old way, the address of the interrupt service routine is stored in a look up table. If the IRQ is shared, the service routine has to query every device in the list until it finds the device that raised the IRQ.
With MSI, the device is programmed with the address of the IRQ when the interrupt service routine is installed.
When the IRQ is acknowledged, the device puts this address on the bus and the CPU jumps to it.
Its more complex and has tighter timing constraints that the old way.
Sometimes, it fixes hard to track down lockups. When MSIs fail, the CPU can jump anywhere.

Test. When the system locks up, in the CPU halted?
In the halt state, pressing the reset button will not restart the CPU. Only the power button will bring it out of halt.

If the CPU is halted. its got itself in a big mess, like it would if it jumped to something that was not code in response to an interrupt.

GPUs generate a lot of IRQs ...

Look at /proc/interrupts
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
ExecutorElassus
Veteran
Veteran


Joined: 11 Mar 2004
Posts: 1182
Location: Stuttgart, Germany

PostPosted: Thu Apr 04, 2019 8:33 pm    Post subject: Reply with quote

Hi Neddy,

I'll try disabling MSI on next reboot. Would that also affect memtest?

When the system locks up (which so far has only once happened outside of gaming, and then immediately after closing the game), the reset button restarts the machine.

So far, though, after dusting off my fan filters, I haven't had any lockups. But since it's quite random, I don't know if that means anything.

I'm going to try borrowing a vid card and see if that solves the memtest issue.

Stay tuned,

EE
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43368
Location: 56N 3W

PostPosted: Fri Apr 05, 2019 8:29 pm    Post subject: Reply with quote

ExecutorElassus,

Turning off MSI fixes lots of marginal timing things.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
C5ace
Apprentice
Apprentice


Joined: 23 Dec 2013
Posts: 282
Location: Brisbane, Australia

PostPosted: Sat Apr 06, 2019 4:58 pm    Post subject: Reply with quote

Had the same problem during our summer with my 9 year old 24/7 system. Fixed it by replacing the dried out termal paste with fresh termal paste between the CPU and heatsink.
_________________
Observation after 30 years working with computers:
All software has known and unknown bugs and vulnerabilities. Especially software written in complex, unstable and object oriented languages such as python, perl, C++, C#, Rust and the likes.
Back to top
View user's profile Send private message
ExecutorElassus
Veteran
Veteran


Joined: 11 Mar 2004
Posts: 1182
Location: Stuttgart, Germany

PostPosted: Sat Apr 06, 2019 5:31 pm    Post subject: Reply with quote

yup, that too. I've since learned that good thermal paste only lasts maybe 6 months, so now I replace it regularly. I also (again) discovered that I need to give the whole case, and especially the fan filters, a thorough vacuuming at least a few times a year. Drops the running temp down a good 30°C.

But since the freezes I was having happened apparently randomly (and only really when gaming) I have no way to know whether I've resolved the problem, except by inference and probability. The longer it goes without freezing, the more likely it looks that my problem, perhaps unrelated to memtest, was simply a problem with something in the case overheating.

I'll keep y'all posted.

Cheers,

EE
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43368
Location: 56N 3W

PostPosted: Sat Apr 06, 2019 5:37 pm    Post subject: Reply with quote

ExecutorElassus,

It really helps if you can generate a simple test case.

Keep in mind too that absence of evidence is not evidence of absence.
So you can't prove the problem is not there any more.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.


Last edited by NeddySeagoon on Sat Apr 06, 2019 8:27 pm; edited 1 time in total
Back to top
View user's profile Send private message
ExecutorElassus
Veteran
Veteran


Joined: 11 Mar 2004
Posts: 1182
Location: Stuttgart, Germany

PostPosted: Sat Apr 06, 2019 7:18 pm    Post subject: Reply with quote

Neddy, you are exactly right. I may have worded it poorly, but that's what I was getting at: all I really know so far is that it hasn't frozen again yet. And my main problem has been, from the beginning, that I can't isolate what's causing the system to freeze in the first place. I just know that it happens when gaming, and seems to happen less when the case/fans/radiator have been dusted. But the proximate cause remains unknown.

So, I guess, I'll just keep trying to get it to freeze, and keep trying different tests (should be able to borrow a vid card soon), and keep you posted.

Cheers,

EE
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43368
Location: 56N 3W

PostPosted: Sat Apr 06, 2019 8:29 pm    Post subject: Reply with quote

ExecutorElassus,

I think that phrase was attributed to Carl Sagan.
I'm sure I heard him use it with regards to SETI and SETI@home.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 7159
Location: almost Mile High in the USA

PostPosted: Sat Apr 06, 2019 10:29 pm    Post subject: Reply with quote

I kind of doubt a video card could cause memtest failures especially in newer machines where busses are mostly separated from each other. But it could be a first especially if the video card for whatever reason causes overload.

On older machines with a separate northbridge, I have had instances where the northbridge overheating or failing, causing memory failures. This shouldn't be the case for modern machines however - unless the CPU has gone bad. BTW do you get memory failures on cacheable vs uncacheable tests?

When I see overheating problems on my PC, it's usually due to dust clogging heatsink fins -- which shouldn't be a problem with water blocks -- but the heatsink compound is a common denominator... IMHO if heat sink compound is applied properly (versus blathered all over the place), I think it should last longer between applications. Then again I really try hard not to need to remove the heatsink so I don't need to reapply heatsink compound, and I have machines that have yet to replace the heatsink compound since initial assembly when new.
_________________
Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum