Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Stable System now Panic/Halt -> Powerdown
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
Hobbes2100
n00b
n00b


Joined: 24 Apr 2002
Posts: 50

PostPosted: Sat Oct 10, 2015 3:01 am    Post subject: Stable System now Panic/Halt -> Powerdown Reply with quote

Hi folks,

I'm dealing with a real *suck* issue. My workstation (circa 2013 build) has been rocking along nicely. However, in the last couple of days, I upgraded my kernel to 4.0.5 and plugged in some old (SATA) drives to check out their content. Today, I'm dealing with a semi-random (but reproducible) total system crash and power off. Here's what happens:

1. boot system
2. log in
3. for example, emerge -v chromium
4. wait a bit
--> one time, I caught a system kernel panic type screen that might have said hardware failure ... it said "rebooting in 30 second" or some such but in about 2 seconds the system powered off .... ugh
5. system state is powered off and I'm grumpy

I attempted a netconsole/nc connection to catch system messages, but while the netconsole was properly logging messages (I was able to receive standard dmesg type information), the system crash was not logged in anyway. There is nothing of interest in /var/log/kernel (with, I think, metalog).

I'm currently running memtest86+ (which will be an overnight thing).

The problem shows under a 3.15/16.x and under 4.0.5/6 (the 15/16 and 5/6 reflect my lack of memory while the machine is mem testing). The 3 series kernel was stable for 1+ years.

Motherboard is a MSI Z77A-GD65 and I'm currently looking into running motherboard level diagnostics on it (after the memtest clears). I can't visibly see a POST, not sure if it needs to be enabled in the BIOS. I may invest in a power supply tester to rule the PSU out.

My fear is that some rogue static borked something in the hardware. I'm hoping the memtest shows a problem, because at least I'll know where the error is and just purchase some new memory. If memtest is clean -- and there are no better ideas -- I'll do the crap-tastic task of remove all hardware, plugging each item in, and trying to elicit the failure.


My questions:

(1) Other things to try before going the one-hardware-item at a time route? The failure time is relatively long and this process would *suck*.
(2) Other ways to get more kernel panic/failure information?
(3) Current diagnostic based bootable images?

Thanks,
Mark
Back to top
View user's profile Send private message
russK
l33t
l33t


Joined: 27 Jun 2006
Posts: 630

PostPosted: Sat Oct 10, 2015 3:57 am    Post subject: Reply with quote

Did you remove the old SATA drives that you were checking out ?
Back to top
View user's profile Send private message
Roman_Gruber
Advocate
Advocate


Joined: 03 Oct 2006
Posts: 3806
Location: Austro Bavaria

PostPosted: Sat Oct 10, 2015 10:26 am    Post subject: Reply with quote

Looks like an overclocker mainboard MSI Z77A-GD65

Are you running safe settings?

Did you changed your power supply already? (proven stable, good psu, thats an own topic whats a good power supply for desctop boxes.)

when you believe it is the software, use a bootable iso like sysrescuecd and boot from that

when you run overclocked stuff => bios failsafe settings, back to base clocks.

changing psu may helps sometimes ... when this is an option.

3( sysrescuecd as it comes with gparted and basic tools.
Back to top
View user's profile Send private message
Hobbes2100
n00b
n00b


Joined: 24 Apr 2002
Posts: 50

PostPosted: Sat Oct 10, 2015 11:17 am    Post subject: Reply with quote

Thanks for the thoughts,

1. The old SATA drives are removed.

2. The system is not overclocked at all.

3. I don't have an alternate power supply on hand, but I will at least test the current one ... and possibly get a new one to test.

4. memtest86+ passed

I'm also going to try to boot from a gentoo LiveCD and do a hard compile to make sure none of the software stack has been corrupted.

Other thoughts?

Best,
Mark
Back to top
View user's profile Send private message
Roman_Gruber
Advocate
Advocate


Joined: 03 Oct 2006
Posts: 3806
Location: Austro Bavaria

PostPosted: Sat Oct 10, 2015 4:54 pm    Post subject: Reply with quote

did you visually checked your mainboard for blown out / or nearly dead capacitors. they tend to blow up those electrolyt ones and than they fail. it seems you have an expensive mainboard but you may visually check those.

somehting like this https://en.wikipedia.org/wiki/Capacitor_plague

It happens slow over the time, especially with those with lower temperature ratings or bad air flow over the time.
Back to top
View user's profile Send private message
Hobbes2100
n00b
n00b


Joined: 24 Apr 2002
Posts: 50

PostPosted: Fri Oct 16, 2015 2:24 am    Post subject: Reply with quote

Amazingly,

As of today, my system (all the same components, except two raid stacks still not plugged in) ... did completely stable compiles of gcc, chromium, and two runs of emerge -e system. Been running stable for two days now under normal use. I'm assuming that when I moved hardware around, I kicked up some dust (I vacuumed later) and/or some stray current tweaked something ... but not too critically.

*sigh*

Hardware. That's why I'm a software guy.

Best,
Mark
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum