Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Help with system recovery
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 224

PostPosted: Mon Jul 09, 2018 4:22 pm    Post subject: Help with system recovery Reply with quote

Hello,
My system has been acting quite strangely since last Friday. Some highlights:

  • A process I was running kept dying with the message "killed". It had run fine several times before that. For what it's worth, it was hot here and the process was a cuda process running on a Titan X.
  • Relatedly, there's this error in my log:
    Code:

    Jul  6 11:03:39 [hostname] kernel: BUG: unable to handle kernel NULL pointer dereference at 00000000000004b9
    Jul  6 11:03:39 [hostname] kernel: IP: [<ffffffff81188bca>] __delete_from_page_cache+0x7a/0x3b0
    Jul  6 11:03:39 [hostname] kernel: PGD 0
    Jul  6 11:03:39 [hostname] kernel:
    Jul  6 11:03:39 [hostname] kernel: Oops: 0000 [#1] SMP
    Jul  6 11:03:39 [hostname] kernel: Modules linked in: nvidia_uvm(PO) nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO)
    Jul  6 11:03:39 [hostname] kernel: CPU: 14 PID: 117 Comm: kswapd0 Tainted: P           O    4.9.95-gentoo #1
    Jul  6 11:03:39 [hostname] kernel: Hardware name: EVGA INTERNATIONAL CO.,LTD Default string/142-SX-E297, BIOS 1.11 03/14/2018
    Jul  6 11:03:39 [hostname] kernel: task: ffff88105e2313c0 task.stack: ffffc90007054000
    Jul  6 11:03:39 [hostname] kernel: RIP: 0010:[<ffffffff81188bca>]  [<ffffffff81188bca>] __delete_from_page_cache+0x7a/0x3b0
    Jul  6 11:03:39 [hostname] kernel: RSP: 0018:ffffc90007057a58  EFLAGS: 00010006
    Jul  6 11:03:39 [hostname] kernel: RAX: 0000000000000001 RBX: ffffea000c9265c0 RCX: 0000000000000000
    Jul  6 11:03:39 [hostname] kernel: RDX: 0000000000000000 RSI: 0000183a2c000102 RDI: ffffea000c9265c0
    Jul  6 11:03:39 [hostname] kernel: RBP: ffffc90007057a98 R08: 000000000001c0c8 R09: ffff88100a0c1778
    Jul  6 11:03:39 [hostname] kernel: R10: 0000000000000020 R11: 0000000000000000 R12: 0000183a2c000102
    Jul  6 11:03:39 [hostname] kernel: R13: ffff88100a0c1790 R14: ffff88100a0c1778 R15: ffff88100a0c1778
    Jul  6 11:03:39 [hostname] kernel: FS:  0000000000000000(0000) GS:ffff88105ed80000(0000) knlGS:0000000000000000
    Jul  6 11:03:39 [hostname] kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Jul  6 11:03:39 [hostname] kernel: CR2: 00000000000004b9 CR3: 0000000004608000 CR4: 0000000000360670
    Jul  6 11:03:39 [hostname] kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    Jul  6 11:03:39 [hostname] kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Jul  6 11:03:39 [hostname] kernel: Stack:
    Jul  6 11:03:39 [hostname] kernel:  ffffc90000000001 ffffea000c9265e0 ffffc90007057c58 ffffea000c9265c0
    Jul  6 11:03:39 [hostname] kernel:  0000000000000001 ffff88100a0c1790 ffff88100a0c1778 0000000000000246
    Jul  6 11:03:39 [hostname] kernel:  ffffc90007057ad8 ffffffff8119a348 0000000000000000 ffffea000c9265c0
    Jul  6 11:03:39 [hostname] kernel: Call Trace:
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff8119a348>] __remove_mapping+0xf8/0x190
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff8119c5e0>] shrink_page_list+0x550/0x940
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff8119d15b>] shrink_inactive_list+0x1eb/0x4a0
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff8119dcfb>] shrink_node_memcg+0x59b/0x740
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff81f52eb0>] ? __switch_to_asm+0x40/0x70
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff81f52eb0>] ? __switch_to_asm+0x40/0x70
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff8119df64>] shrink_node+0xc4/0x2e0
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff8119eba2>] kswapd+0x2b2/0x640
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff8119e8f0>] ? mem_cgroup_shrink_node+0x90/0x90
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff810f02e2>] kthread+0xd2/0xf0
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff81f52ea4>] ? __switch_to_asm+0x34/0x70
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff810f0210>] ? kthread_park+0x60/0x60
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff81f52f37>] ret_from_fork+0x57/0x70
    Jul  6 11:03:39 [hostname] kernel: Code: 01 00 00 48 8b 57 20 48 8d 42 ff 83 e2 01 48 0f 44 c7 48 8b 00 a9 00 00 01 00 0f 84 88 01 00 00 48 8b 47 08 48 8b 00 48 8b 40 28 <8b> 90 b8 04 00 00 85 d2 78 05 e8 a7 5a 07 00 48 89 df e8 df d4
    Jul  6 11:03:39 [hostname] kernel: RIP  [<ffffffff81188bca>] __delete_from_page_cache+0x7a/0x3b0
    Jul  6 11:03:39 [hostname] kernel:  RSP <ffffc90007057a58>
    Jul  6 11:03:39 [hostname] kernel: CR2: 00000000000004b9
    Jul  6 11:03:39 [hostname] kernel: ---[ end trace e984d167a3afafc8 ]---
    Jul  6 11:03:39 [hostname] kernel: BUG: unable to handle kernel NULL pointer dereference at 00000000000000d1
    Jul  6 11:03:39 [hostname] kernel: IP: [<ffffffff812f4312>] ext4_mpage_readpages+0x82/0x8f0
    Jul  6 11:03:39 [hostname] kernel: PGD 8000000fd0e1d067
    Jul  6 11:03:39 [hostname] kernel: PUD 0
    Jul  6 11:03:39 [hostname] kernel:
    Jul  6 11:03:39 [hostname] kernel: Oops: 0000 [#2] SMP
    Jul  6 11:03:39 [hostname] kernel: Modules linked in: nvidia_uvm(PO) nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO)
    Jul  6 11:03:39 [hostname] kernel: CPU: 0 PID: 6622 Comm: python3 Tainted: P      D    O    4.9.95-gentoo #1
    Jul  6 11:03:39 [hostname] kernel: Hardware name: EVGA INTERNATIONAL CO.,LTD Default string/142-SX-E297, BIOS 1.11 03/14/2018
    Jul  6 11:03:39 [hostname] kernel: task: ffff881059ca93c0 task.stack: ffffc90009558000
    Jul  6 11:03:39 [hostname] kernel: RIP: 0010:[<ffffffff812f4312>]  [<ffffffff812f4312>] ext4_mpage_readpages+0x82/0x8f0
    Jul  6 11:03:39 [hostname] kernel: RSP: 0000:ffffc9000955baf8  EFLAGS: 00010202
    Jul  6 11:03:39 [hostname] kernel: RAX: 0000000000000001 RBX: 000000000000000c RCX: 000000000000000c
    Jul  6 11:03:39 [hostname] kernel: RDX: 0000000000000001 RSI: 0000000000000020 RDI: ffff88100a0c1600
    Jul  6 11:03:39 [hostname] kernel: RBP: ffffc9000955bbe0 R08: 000000000001c070 R09: 0000000000000002
    Jul  6 11:03:39 [hostname] kernel: R10: ffff88109fff8000 R11: 0000000000000000 R12: 0000000000000000
    Jul  6 11:03:39 [hostname] kernel: R13: 0000000000000020 R14: 0000000000000020 R15: 000000000095ff40
    Jul  6 11:03:39 [hostname] kernel: FS:  00007ffb93e905c0(0000) GS:ffff88105ea00000(0000) knlGS:0000000000000000
    Jul  6 11:03:39 [hostname] kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Jul  6 11:03:39 [hostname] kernel: CR2: 00000000000000d1 CR3: 0000000fd38d0000 CR4: 0000000000360670
    Jul  6 11:03:39 [hostname] kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    Jul  6 11:03:39 [hostname] kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Jul  6 11:03:39 [hostname] kernel: Stack:
    Jul  6 11:03:39 [hostname] kernel:  ffffffff81f52ea4 ffffc9000955bb58 ffffffff810698b7 ffff88109fff9700
    Jul  6 11:03:39 [hostname] kernel:  0000000000000000 0000000c00001000 ffff88100a0c1778 0000000000000020
    Jul  6 11:03:39 [hostname] kernel:  ffffc9000955bb58 ffffc9000955bc40 ffffea000c950d80 ffff88100a0c1600
    Jul  6 11:03:39 [hostname] kernel: Call Trace:
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff81f52ea4>] ? __switch_to_asm+0x34/0x70
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff810698b7>] ? __switch_to+0x2b7/0x4a0
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff812b7760>] ext4_readpages+0x30/0x40
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff811967ae>] __do_page_cache_readahead+0x16e/0x230
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff811969a6>] ondemand_readahead+0x136/0x230
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff8156a403>] ? radix_tree_lookup_slot+0x13/0x30
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff81196b06>] page_cache_async_readahead+0x66/0x70
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff8118a321>] filemap_fault+0x211/0x4e0
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff812bfdd1>] ext4_filemap_fault+0x31/0x50
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff811b357c>] __do_fault+0x6c/0x130
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff811b979e>] handle_mm_fault+0xe3e/0x1420
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff81f52eb0>] ? __switch_to_asm+0x40/0x70
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff81f52ea4>] ? __switch_to_asm+0x34/0x70
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff8109a078>] __do_page_fault+0x218/0x450
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff8109a302>] do_page_fault+0x22/0x30
    Jul  6 11:03:39 [hostname] kernel:  [<ffffffff81f54768>] page_fault+0x28/0x30
    Jul  6 11:03:39 [hostname] kernel: Code: 00 00 00 00 c7 45 90 00 00 00 00 89 c3 89 85 44 ff ff ff b8 00 10 00 00 89 d9 d3 e2 48 d3 e8 85 f6 89 95 40 ff ff ff 48 8b 57 28 <48> 8b ba d0 00 00 00 48 89 bd 20 ff ff ff 0f 84 9d 06 00 00 41
    Jul  6 11:03:39 [hostname] kernel: RIP  [<ffffffff812f4312>] ext4_mpage_readpages+0x82/0x8f0
    Jul  6 11:03:39 [hostname] kernel:  RSP <ffffc9000955baf8>
    Jul  6 11:03:39 [hostname] kernel: CR2: 00000000000000d1
    Jul  6 11:03:39 [hostname] kernel: ---[ end trace e984d167a3afafc9 ]---

  • I tried to downgrade the nvidia-drivers module, but it wanted to go back to gcc 6.4 and that failed to compile .
  • Just tried reemerging nvidia-toolkit-drivers but it's saying that the hash (SHA256 I think) is off. Downloading fresh copies of the distfile and ebuild and re-emerging gzip don't seem to help.
  • I ran "memtester" (user space util) and it came back with a bunch of errors. However, running memtest86 from a USB was fine.
  • Between the issues of Friday (the "killed" processes and nvidia module crash) I tried to rebuild the kernel. At some point this process seemed to freak out.
  • All day Friday I had people probing the machine and trying to break in. To my knowledge they didn't get in but perhaps it's related?


So my guess is that at least my kernel is corrupt (will try rebooting with an old one). Any advice on what else to try would be appreciated. I might be pre-mature posting, I only figured that I probably have a bad kernel while writing this up. But there's enough going on here that I'd appreciate any advice!! Could this be a hardware issue (I'm thinking overheating potentially, but didn't see anything in the system log)? Corrupted system? I guess that just calls for "emerge -e world"? Anything else?
Back to top
View user's profile Send private message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 224

PostPosted: Mon Jul 09, 2018 4:39 pm    Post subject: Reply with quote

Update: just rebooted the machine with an old kernel (from before I was having trouble). Still can't compile the kernel; failing with:
Code:

    CC      arch/x86/kernel/cpu/mcheck/therm_throt.o
arch/x86/kernel/cpu/mcheck/therm_throt.c: In function ‘intel_thermal_supported’:
arch/x86/kernel/cpu/mcheck/therm_throt.c:460:1: internal compiler error: Illegal instruction
 }
 ^
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://bugs.gentoo.org/> for instructions.
make[4]: *** [scripts/Makefile.build:294: arch/x86/kernel/cpu/mcheck/therm_throt.o] Error 1
make[3]: *** [scripts/Makefile.build:544: arch/x86/kernel/cpu/mcheck] Error 2
make[2]: *** [scripts/Makefile.build:544: arch/x86/kernel/cpu] Error 2
make[1]: *** [scripts/Makefile.build:544: arch/x86/kernel] Error 2
make: *** [Makefile:1006: arch/x86] Error 2


I make clean and try again; it gets past that point but fails later:
Code:


CC      arch/x86/kvm/../../../virt/kvm/async_pf.o
  CC      arch/x86/kvm/x86.o
arch/x86/kvm/x86.c: In function ‘kvm_vm_ioctl_get_dirty_log’:
arch/x86/kvm/x86.c:8611:1: internal compiler error: Segmentation fault
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_incomplete_ipi);
 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://bugs.gentoo.org/> for instructions.
make[2]: *** [scripts/Makefile.build:294: arch/x86/kvm/x86.o] Error 1
make[1]: *** [scripts/Makefile.build:544: arch/x86/kvm] Error 2
make: *** [Makefile:1006: arch/x86] Error 2

This is starting to feel strongly like a motherboard issue? The motherboard was super-flaky when I got it but seemed to stabilize when I upgrade the BIOS.

Are there tools I can use to be more definitive? If I call the motherboard manufacturer what do I tell them? Is there a test I can point to that would indicate the failure is in the mboard? For what it's worth, running memorytest86 (passmark version) from a usb stick came up fine.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43184
Location: 56N 3W

PostPosted: Mon Jul 09, 2018 7:37 pm    Post subject: Reply with quote

justin_brody,

Code:
internal compiler error: Illegal instruction
That says that your gcc contains instructions that your CPU cannot execute.
If its real, your CFLAGS are set to allow gcc to produce code that will not run on your system.
It may be CPU_FLAGS_X86 too.

If thats not the cause, its probably hardware. dmesg (all of it) might show more, once the error has occured.

There is a third possibility, that I only mention for completeness. That is that gcc itself is actually faulty but then, you would not be the only one with this issue.

Code:
internal compiler error: Segmentation fault
means that gcc tried to access memory that it didn't own,

Taken together, it points to hardware issues but CFLAGS/CPU_FLAGS_X86 changes are not yet ruled out.


memtest86 is only useful if you boot into it. The errors that it returns don't always indicate RAM problems either.
If you get a hard error, thats the same issue at the same address, with the same error pattern on every cycle, its probably RAM.
You test that theory by swapping RAM sticks around and look for changes in the address or error pattern.

If the errors are not repeatable, it may not be RAM. The memory controller, CPU, RAM, PSU, on board PSU are all involved.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
LuxJux
Guru
Guru


Joined: 01 Mar 2016
Posts: 300

PostPosted: Mon Jul 09, 2018 8:42 pm    Post subject: Reply with quote

justin_brody wrote:
Are there tools I can use to be more definitive?

Take here a checkout and update your make.conf

https://wiki.gentoo.org/wiki/CPU_FLAGS_X86


Maybe it helps, maybe not.
Back to top
View user's profile Send private message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 224

PostPosted: Mon Jul 09, 2018 10:40 pm    Post subject: Reply with quote

Many thanks to both of you for your insights. The machine is offline at my work right now so I can't check the make.conf immediately, but I don't think it should the GCC flags. The machine has 7820X processor and I'm pretty sure my flags were "-O2 -march=skylake". I was running a lot of numerical computing code (scipy + tensorflow) that was compiled for the machine, and I would guess they were using some of the more advanced CPU features. No issues on that score until last Friday. I will definitely try to verify tomorrow morning though...

In terms of the RAM, when I boot into memtest86 I get no errors, and the errors I was getting were not consistent. So I'm guessing it's not the RAM.

Thanks again for the help!!
Back to top
View user's profile Send private message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 224

PostPosted: Tue Jul 10, 2018 6:49 pm    Post subject: Reply with quote

O.k., so the saga continues. I downloaded "the ultimate boot cd" and ran some of the stress tests on it. Everything came back fine. So, wondering if my installation was just corrupt, I booted a "System Resuce CD" and started the installation fresh. Running emerge-webrsync gave me "xzcat: /var/tmp/portage/webrsync-C)8LrX/portage-20180709.xz: Compressed data is corrupt."

Reminds me of the problems from yesterday, but now I'm in what should be a really safe environment :(

Very frustrated because I only seem to be getting these issues when I'm a running OS, not in any of the testing environments.

What can I do at this point? Should I ask for a new motherboard? Is that the most likely issue? I think I can rule out the power supply, because I used the same power supply in a previous build which ran perfectly. And advice appreciated!!
Back to top
View user's profile Send private message
ali3nx
l33t
l33t


Joined: 21 Sep 2003
Posts: 612
Location: Winnipeg, Canada

PostPosted: Tue Jul 10, 2018 7:06 pm    Post subject: Reply with quote

justin_brody wrote:
O.k., so the saga continues. I downloaded "the ultimate boot cd" and ran some of the stress tests on it. Everything came back fine. So, wondering if my installation was just corrupt, I booted a "System Resuce CD" and started the installation fresh. Running emerge-webrsync gave me "xzcat: /var/tmp/portage/webrsync-C)8LrX/portage-20180709.xz: Compressed data is corrupt."

Reminds me of the problems from yesterday, but now I'm in what should be a really safe environment :(

Very frustrated because I only seem to be getting these issues when I'm a running OS, not in any of the testing environments.

What can I do at this point? Should I ask for a new motherboard? Is that the most likely issue? I think I can rule out the power supply, because I used the same power supply in a previous build which ran perfectly. And advice appreciated!!


Were you running an ssd on this system, are you running systemd or openrc and did you have fstrim configured?

One consideration with regards to systemd builds is that systemd runs a lot of the core system in tmpfs. I Haven't used openrc based builds in years because systemd in my experience performs better due to heavy use of tempfs and service start parallelism.

With that mentioned maybe it's a hard disk issue if you do have an ssd and aren't using tmpfs for portage.

https://wiki.gentoo.org/wiki/Portage_TMPDIR_on_tmpfs

Also would be good to know if you have or did configure tmpfs for portage.

Corrupt data inside the chroot failing to match checksums leads me to believe there may be an issue with the hard disk or ssd.

If you do use an ssd you must configure fstrim to run weekly or you will have problems.

Now if you are using systemd or openrc and getting compile failures while compiling in tmpfs that may point towards a hardware issue unrelated to a hard disk failure or an ssd that hasn't been trimmed long enough to cause data write failures.
_________________
Compiling Gentoo since version 1.4
Thousands of Gentoo Installs Completed
Emerged on every continent but Antarctica
Compile long and Prosper!
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43184
Location: 56N 3W

PostPosted: Tue Jul 10, 2018 8:02 pm    Post subject: Reply with quote

justin_brody,

emerge smartmontools and run
smartctl -a /dev/sd....
Do that for each drive in your system.

Post the output. It will tell what the HDD thinks is happening.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 224

PostPosted: Tue Jul 10, 2018 8:52 pm    Post subject: Reply with quote

Dear ali3nx and NeddySeagoon,

Many, many thanks to both of you for these suggestions. I was starting to wonder about hard drive failure as well. Some data:

  • The drive is not an SSD and I run openrc rather then systemd.
  • I have not configured tmpfs.
  • I've only been using a single drive. It's actually attached to my old motherboard right now because I'm trying to get a workable machine up and running. The output from smartctl:
    Code:

     ~ # smartctl -a /dev/sdb
    smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.14.52-gentoo] (local build)
    Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

    === START OF INFORMATION SECTION ===
    Model Family:     Seagate Barracuda 3.5
    Device Model:     ST2000DM006-2DM164
    Serial Number:    Z4ZB8BKQ
    LU WWN Device Id: 5 000c50 0b12b3ddf
    Firmware Version: CC26
    User Capacity:    2,000,398,934,016 bytes [2.00 TB]
    Sector Sizes:     512 bytes logical, 4096 bytes physical
    Rotation Rate:    7200 rpm
    Form Factor:      3.5 inches
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
    SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
    Local Time is:    Tue Jul 10 16:46:33 2018 EDT
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED

    General SMART Values:
    Offline data collection status:  (0x00) Offline data collection activity
                                            was never started.
                                            Auto Offline Data Collection: Disabled.
    Self-test execution status:      (   0) The previous self-test routine completed
                                            without error or no self-test has ever
                                            been run.
    Total time to complete Offline
    data collection:                (   80) seconds.
    Offline data collection
    capabilities:                    (0x73) SMART execute Offline immediate.
                                            Auto Offline data collection on/off support.
                                            Suspend Offline collection upon new
                                            command.
                                            No Offline surface scan supported.
                                            Self-test supported.
                                            Conveyance Self-test supported.
                                            Selective Self-test supported.
    SMART capabilities:            (0x0003) Saves SMART data before entering
                                            power-saving mode.
                                            Supports SMART auto save timer.
    Error logging capability:        (0x01) Error logging supported.
                                            General Purpose Logging supported.
    Short self-test routine
    recommended polling time:        (   1) minutes.
    Extended self-test routine
    recommended polling time:        ( 210) minutes.
    Conveyance self-test routine
    recommended polling time:        (   2) minutes.
    SCT capabilities:              (0x1085) SCT Status supported.

    SMART Attributes Data Structure revision number: 10
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000f   114   099   006    Pre-fail  Always       -       67533200
      3 Spin_Up_Time            0x0003   096   096   000    Pre-fail  Always       -       0
      4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       87
      5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000f   064   060   030    Pre-fail  Always       -       4298079004
      9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       621
     10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       87
    183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
    184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
    187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
    188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
    189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
    190 Airflow_Temperature_Cel 0x0022   067   062   045    Old_age   Always       -       33 (Min/Max 28/35)
    191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
    192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       49
    193 Load_Cycle_Count        0x0032   098   098   000    Old_age   Always       -       4441
    194 Temperature_Celsius     0x0022   033   040   000    Old_age   Always       -       33 (0 23 0 0 0)
    197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
    240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       358h+15m+45.270s
    241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       2053864309
    242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       22112340581

    SMART Error Log Version: 1
    No Errors Logged

    SMART Self-test log structure revision number 1
    No self-tests have been logged.  [To run self-tests, use: smartctl -t]

    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.



I'll put the drive back in the machine that's giving me trouble and post the results from that. I suppose errors there would point to the hard drive controller?

Thanks again for the help troubleshooting this -- I'm way out of my depth here!
Back to top
View user's profile Send private message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 224

PostPosted: Tue Jul 10, 2018 9:00 pm    Post subject: Reply with quote

O.k., from what I can tell it's giving the same output when run in the other machine:
Code:

 % smartctl -a /dev/sda 
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.14.32-std522-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 3.5
Device Model:     ST2000DM006-2DM164
Serial Number:    Z4ZB8BKQ
LU WWN Device Id: 5 000c50 0b12b3ddf
Firmware Version: CC26
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jul 10 20:56:21 2018 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)   Offline data collection activity
               was never started.
               Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)   The previous self-test routine completed
               without error or no self-test has ever
               been run.
Total time to complete Offline
data collection:       (   80) seconds.
Offline data collection
capabilities:           (0x73) SMART execute Offline immediate.
               Auto Offline data collection on/off support.
               Suspend Offline collection upon new
               command.
               No Offline surface scan supported.
               Self-test supported.
               Conveyance Self-test supported.
               Selective Self-test supported.
SMART capabilities:            (0x0003)   Saves SMART data before entering
               power-saving mode.
               Supports SMART auto save timer.
Error logging capability:        (0x01)   Error logging supported.
               General Purpose Logging supported.
Short self-test routine
recommended polling time:     (   1) minutes.
Extended self-test routine
recommended polling time:     ( 210) minutes.
Conveyance self-test routine
recommended polling time:     (   2) minutes.
SCT capabilities:           (0x1085)   SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   114   099   006    Pre-fail  Always       -       67739800
  3 Spin_Up_Time            0x0003   096   096   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       88
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   064   060   030    Pre-fail  Always       -       4298079163
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       621
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       88
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   069   062   045    Old_age   Always       -       31 (Min/Max 31/31)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       49
193 Load_Cycle_Count        0x0032   098   098   000    Old_age   Always       -       4443
194 Temperature_Celsius     0x0022   031   040   000    Old_age   Always       -       31 (0 23 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       358h+17m+15.791s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       2053864821
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       22112371425

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



As I'm typing this I'm remembering that the live usb distros have seemed somewhat flaky on this machine as well. e.g. firefox tabs were crashing last time I booted into the one I'm using right now...
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43184
Location: 56N 3W

PostPosted: Tue Jul 10, 2018 9:12 pm    Post subject: Reply with quote

justin_brody,

The values under
Code:
VALUE WORST THRESH
are normalised.
If VALUE or WORST <= THRESH the parameter has failed.

Some key ones are
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       621
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
...
SMART Error Log Version: 1
No Errors Logged
So that drive is healthy.

Don't get excited over huge numbers in RAW_VALUE. These values are often packed bit field and are vendor specific.
This data is from the drive, unless something changes inside the drive, it will be the same wherever you read it.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 224

PostPosted: Tue Jul 10, 2018 9:27 pm    Post subject: Reply with quote

O.k. -- thanks for the information. I'm very happy to report that I just got some negative results on a Mersenne Prime Stress Test. Never been so happy to know I had a genuine hardware problem :)))

The program said it saved info in a file I can't find, but I feel like at this point I can put it down to either the motherboard or the CPU.

This is completely not a Gentoo issue at this point, but just in case: any idea how tell whether it's the CPU or the motherboard? (Sadly I don't have an extra of either lying around to test with).
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 43184
Location: 56N 3W

PostPosted: Wed Jul 11, 2018 4:43 pm    Post subject: Reply with quote

justin_brody,

Quick Recap.
Memtest86 works the RAM hard but does very little IO
smartctl says your HDD is OK.
The Mersenne Prime Stress Test stresses your CPU but does very little IO.

All thats left is IO, Motherboard or PSU

One straw thats quick and easy to grasp at is the SATA data cable. Testing by replacement is best.
Unplugging it and reconnecting it (both ends) has been known to work for a few months.

Another is the 12v supply to the motherboard that provides all the CPU and RAM power. Its a 4/6/8 way connector with only black and yellow wires.
Mine is badly charred from overheating. Its worth 'wiping' the contacts. Just unplug it and reconnect it.

After that, its guesswork.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum