Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Hardware errors - Memory stick faulty?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1520
Location: KUUSANKOSKI, Finland

PostPosted: Tue Dec 19, 2017 12:30 pm    Post subject: Hardware errors - Memory stick faulty? Reply with quote

So... my server syslogger spit these as wall messages:
Code:
Message from syslogd@zelan at Tue Dec 19 14:57:22 2017 ...
zelan kernel: [11120784.801393] [Hardware Error]: Corrected error, no action required.

Message from syslogd@zelan at Tue Dec 19 14:57:22 2017 ...
zelan kernel: [11120784.801726] [Hardware Error]: MC4 Error (node 0): L3 data cache ECC error.

Message from syslogd@zelan at Tue Dec 19 14:57:22 2017 ...
zelan kernel: [11120784.801485] [Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9d18c480001c011b

Message from syslogd@zelan at Tue Dec 19 14:57:22 2017 ...                                                                               zelan kernel: [11120784.801806] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
                                                                                                                                         Message from syslogd@zelan at Tue Dec 19 14:57:22 2017 ...
zelan kernel: [11120784.801647] [Hardware Error]: Error Addr: 0x00000002d901ee14


If it's RAM, it's no problem. If it's CPU, then I'll be having hard time to find another 65W TDP AM3 CPU there...

Anyone have experience in "decoding" these messages?
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43195
Location: 56N 3W

PostPosted: Tue Dec 19, 2017 1:15 pm    Post subject: Reply with quote

Zucca,

Its a Level 3 cache problem. That's CPU.
Corrected error, no action required, means that it was a singe bit error and the ECC corrected it.

If its a one time thing, it may be a passing cosmic ray. If its repeatable, its a failing CPU.

Do you have ECC RAM in that server?
If not you can't get ECC errors due to RAM.

The ECC can detect and correct all single bit errors. It can detect but not correct two bit errors.
A detected two bit error should shut the system down with a kernel panic.

With three or more bits in error, anything might happen.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1520
Location: KUUSANKOSKI, Finland

PostPosted: Tue Dec 19, 2017 1:29 pm    Post subject: Reply with quote

Hi Neddy.
It's a consumer motherboard with 16Gigs of regular (non-ECC) DDR3. The CPU is Opteron 3380.
I'm confused as to how it corrected the error...

This is the only time it's happened. I'll be keeping my eye on the logs...
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43195
Location: 56N 3W

PostPosted: Tue Dec 19, 2017 1:47 pm    Post subject: Reply with quote

Zucca,

The L3 cache inside the CPU, stores 64 bits of data and 8 bits of 'parity'.
In the bad old days, that would be one parity bit per byte, which allows single bit errors to be detected but not corrected.

However, 8 bits of redundancy is enough to do Hamming code over 64 data bits.
That's a much better use of the redundancy.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Zucca
Veteran
Veteran


Joined: 14 Jun 2007
Posts: 1520
Location: KUUSANKOSKI, Finland

PostPosted: Tue Dec 19, 2017 2:14 pm    Post subject: Reply with quote

NeddySeagoon wrote:
The L3 cache inside the CPU, stores 64 bits of data and 8 bits of 'parity'.
In the bad old days, that would be one parity bit per byte, which allows single bit errors to be detected but not corrected.
Nice. That's a progress I like. :)
So far it looks like just a cosmic ray flipping one single bit...
_________________
..: Zucca :..

Code:
ERROR: '--failure' is not an option. Aborting...
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum