Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
How should I find out which dimm is bad?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
tholin
Apprentice
Apprentice


Joined: 04 Oct 2008
Posts: 168

PostPosted: Wed Oct 15, 2014 1:17 pm    Post subject: How should I find out which dimm is bad? Reply with quote

The long story is, I got segfaults when compiling chromium and ran some memtest86+ runs. They didn't show any problems at first so I looked for a software problem for a long time until I finally got memtest to report an error. That was after running it for 10h+. I masked out the broken addresses with the memmap kernel parameter and the segfaults went away.

To find the bad dimm I removed them all and tested them one at a time with memtest86+ for 24h each (15 passes). I didn't get any error from any of them. When I put all dimms back in I got the same error again. Even though I put all the dimms in different slots I still got error on the same address as before. To me this suggested the problem was in the motherboard. One of the sata connectors on the motherboard stopped working at the same time so I was pretty sure it was the motherboard.

I bought a new motherboard and cpu go to with it because I couldn't find any motherboard with the same socket as my old cpu. I reused my old ram with the assumption that it was good. Now I get segfaults again...

Memtest86+ shows an error on the same address as before even on this new system. How am I suppose to find out which dimm is bad? Testing them for 24h was apparently not enough. And why do I always got an error on the same address? Perhaps the bios maps the dimms sorted by serial number or something? I tested several versions of memtest86+ and they all give the same error. I've wasted 550€ on this problem already and don't want to buy more stuff before I know what is wrong for sure.
Back to top
View user's profile Send private message
RazielFMX
l33t
l33t


Joined: 23 Apr 2005
Posts: 835
Location: NY, USA

PostPosted: Wed Oct 15, 2014 3:45 pm    Post subject: Reply with quote

I would start by checking for a BIOS update and inspecting the settings in the BIOS. It is possible you had a bad motherboard (which sounds exactly correct given your other component failures) and bad RAM, but let's not assume anything. Also, are all your RAM sticks from the same vendor, same size, and same speed?
Back to top
View user's profile Send private message
tholin
Apprentice
Apprentice


Joined: 04 Oct 2008
Posts: 168

PostPosted: Thu Oct 16, 2014 11:16 am    Post subject: Reply with quote

Both the new and old motherboard had the latest bios. The testing was done with bios "optimized defaults" which ran the ram in JEDEC compatible mode (1333mhz at 1.5V). I also tested with the XMP profile (1600 mhz at 1.65v) and both failed.

All dimm sticks are identical. 4 Corsair CMX8GX3M2A1600C9 each 4G.
Back to top
View user's profile Send private message
RazielFMX
l33t
l33t


Joined: 23 Apr 2005
Posts: 835
Location: NY, USA

PostPosted: Thu Oct 16, 2014 12:19 pm    Post subject: Reply with quote

What was your testing procedure? All sticks in the same slot? Did you remove one stick at a time and run but not change slots?

Also what is your motherboard?

The reason I ask is as follows.

Test Scenario 1: Run each dimm in a known good slot

- This test attempts to isolate the bad dimm.

Test Scenario 2: Replace all RAM, remove sticks one at a time until the test comes back clean

- This test attempts to isolate the bad slot.

Test Scenario 3: Test RAM in another machine

- With the i5 and i7 and with AMD chips, the memory controller is integrated into the CPU (You could do this test with a different CPU but same RAM/motherboard).
- Older Intel chips the memory controller is integrated into the northbridge (motherboard).
- If the RAM comes back clean, you have a problem with your memory controller. On the motherboard, this is either a fundamental design flaw or something that may be fixed with firmware/BIOS. If it is the CPU, I really have no idea where to go from there.
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 7071

PostPosted: Thu Oct 16, 2014 1:00 pm    Post subject: Reply with quote

rams are just little bitch ; you assume your ram may have a problem when they might just have none.

Sometimes the mb bug with some ram model, sometimes a bios fix that, sometimes it bug when you wronly put them in slots (not your case as you fill all slots), sometimes mb have a problem with all slots fill, sometimes you get the problem with all slots fill and some ram model only...

And just sometimes it's the heat. You know the : my rams works but fail sometimes, and it took me long time to see memtest fail with them (long time, highly working -> heat too high -> failure)

You should gave your mb model/version so other users can help you with that (as a "i have that mb and those ram and all is fine" is an help too).
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43178
Location: 56N 3W

PostPosted: Thu Oct 16, 2014 7:36 pm    Post subject: Reply with quote

... and sometimes its contact resisitance between the DIMM and the slot.

You remove the RAM, look at it, and put it back in the same place. Then its all good for a year or so and you need to repeat the process.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
tholin
Apprentice
Apprentice


Joined: 04 Oct 2008
Posts: 168

PostPosted: Fri Oct 17, 2014 11:27 am    Post subject: Reply with quote

RazielFMX wrote:
What was your testing procedure? All sticks in the same slot? Did you remove one stick at a time and run but not change slots?

I tested all the dimms in the same slot. The one recommended by the motherboard manual for a single dimm.

RazielFMX wrote:
Also what is your motherboard?

The old board was an Asus P8P67 Pro and the new one is Gigabyte Z97X-Gaming G1

I should also point out that I used the old system from early 2011 to May this year without any problems. My syslog tells me that's when I started getting segfaults. It wasn't until August I realized it was hardware problem. I didn't change any hardware configuration so something broke by itself at that time.

The errors I get from memtest are also odd. Usually only test #7 fails. I have also seen failures of test #6 but they are very rare. To speed up the test I've tried to run only test #7 in a loop but that will never result in any errors by itself. I have to run all test usually for at least 5 passes before I get an error but after I got the first error I can run test #7 in a loop and get more errors like that. Perhaps memtest slowly warm up the dimms but it takes several hours before the heat results in an error. Could explain why I didn't get any errors when I tested the dimm individually.

Dual channel operation complicates things. I don't know exactly how it works but I assume it interleaves a pair of dimms in a raid like fashion. When I rearranged the dimms I might have put the bad dimm in the other dual channel slot. Then a read of the same 128-bit data would fail resulting in an error at about the same address. But would that result in an error on the exact same address?

Since the error is always at the same address perhaps the problem is caused by bad dma? The only dma capable device I reused in the new system is the graphics card. The error is always at physical address 7919.8M so that could explain why I didn't get an error when testing only a single dimm. They are only 4G so if the spurious dma is written to 7919.8M it would end up outside of that. Spurious dma would not explain why only test #7 fails. I would expect all test to fail in that case. The easiest way to test that would be to remove the graphics card but I can't reach the pcie latch to free the card without taking out the entire cpu heat sink first...
Back to top
View user's profile Send private message
hermanhedning
n00b
n00b


Joined: 17 Oct 2014
Posts: 2
Location: Stockholm

PostPosted: Fri Oct 17, 2014 2:28 pm    Post subject: Reply with quote

From what I can see is that you have not swapped the HDD. That's something I always do comeon I want to localize a hardware issue.

Apart from that I got no clue :/
_________________
Specialist in Everything which borders to nothing.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum