Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Advice? G4 spontaneous reboots, Oops on bootup!
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Gentoo on PPC
View previous topic :: View next topic  
Author Message
yther
Apprentice
Apprentice


Joined: 25 Oct 2002
Posts: 151
Location: Charlotte, NC (USA)

PostPosted: Mon Mar 06, 2006 3:17 am    Post subject: Advice? G4 spontaneous reboots, Oops on bootup! Reply with quote

Hi! :D (I'm determined to stay in a good mood here, since getting mad doesn't help me think...)

I've been having some occasional problems with this G4 (SMP @450 MHz, made in 2000), which is running a hardened profile. The one I had made absolutely no headway with was that it would sometimes reboot by itself. The reboot always happened while I was away from the machine, and the only common factor I can find is that I was compiling something and X was running. I find nothing in the logs that makes me say, "Aha!"; simply a lot of compile-time processes running, and then suddenly I'm looking at boot-up entries.

Now, today I noticed that kernel 2.6.14-hardened-r5 had been marked stable, so I decided to upgrade (from 2.6.11-hardened-r15). Upon reboot, I got an Oops as hotplug was running. (Grsec helpfully spews out a message on the console every time a new process runs, until something hides it part-way through the boot procedure.) The machine rebooted by itself before I could write down the entire message, but I do know it was a signal 4, and I think there was a message about CPU communication.

Note to self: Have digital camera ready next time you reboot, so you can take a picture of the screen!

Naturally, I assumed it was something to do with my brand-new kernel. (Kernel upgrades do not tend to go as smoothly for me on this box as they do on my i686 box.) So, I tried to reboot with my previous kernel. This resulted in another Oops, which I stupidly did not record.

At this point, I was baffled, and assumed that something about my current configuration made 2.6.11 unusable. I then booted to a 2005.1 LiveCD, chrooted, changed the time since the machine suddenly thought it was 1903, and compiled the kernel without the EHCI USB controller enabled. (I could tell this box doesn't have one, so I figured it was worth a shot.) Upon rebooting, things went fairly smoothly, all hard drives were checked since it had apparently been over 12,000 days since the last check, and I thought I had solved the problem.

I decided to reboot once more, to see if the system time would stay set (thinking the 6-year-old battery in the motherboard might need replacing). Well, this time it started up, and I saw the usual flurry of messages from grsec, ending with:
Code:
smp_call_function in cpu 0: other cpus not responding (0)
Oops: kernel access of bad area, sig: 11 [#1]

Now, when I see SIG11 popping up, I immediately think of bad memory. The further fact that this oops was different from the other ones makes it seem fairly random. I turned the power off, waited a few seconds, and started up again, doing nothing differently from the last time. To my surprise, I was able to boot successfully!

So. Random oopses while trying to boot. They seem to be independent of kernel version, and also seem to have to do with communication between the CPUs. Occasional restarts if I compile things while X is running. (Not yet tested with new kernel; I'm going to avoid compiling stuff for the time being.)

Do y'all think it's a memory problem, or something else? Any suggestions on how to narrow it down?

Many thanks!
Back to top
View user's profile Send private message
JoseJX
Retired Dev
Retired Dev


Joined: 28 Apr 2002
Posts: 2774

PostPosted: Mon Mar 06, 2006 3:29 am    Post subject: Reply with quote

Try emerging memtester and see if you have any bad memory.
_________________
Gentoo PPC FAQ: http://www.gentoo.org/doc/en/gentoo-ppc-faq.xml
Back to top
View user's profile Send private message
yther
Apprentice
Apprentice


Joined: 25 Oct 2002
Posts: 151
Location: Charlotte, NC (USA)

PostPosted: Tue Mar 07, 2006 3:34 am    Post subject: memtester! Reply with quote

Thanks! I knew there must be a PPC memory tester, but for some reason I could never find this utility. (Don't ask me why; all I needed to do was "esearch -Sc memory".) :oops:

I will definitely run this for an hour or two and see what happens!

EDIT: Note that memtester will not run under a PaX kernel; I'll have to boot to a LiveCD so it can grab lots of memory without being killed!
Back to top
View user's profile Send private message
yther
Apprentice
Apprentice


Joined: 25 Oct 2002
Posts: 151
Location: Charlotte, NC (USA)

PostPosted: Fri Mar 17, 2006 5:03 am    Post subject: Reply with quote

Ok, so memtester results are, so far, inconclusive (no failures, but I wasn't able to grab all physical memory from the LiveCD). Last night the system locked up (no reboot) hard; KDE's clock was showing 08:50 but the latest log entry was 06:18. (The last entry was "NETDEV WATCHDOG: eth0: transmit timed out".) Three reboots were required to get past random Oopses, and my mobo battery is definitely dead[1] since I lost the time after powering off for ten seconds. Grrr!

I just rebuilt the kernel and the only option I changed was disabling preemption of "The Big Kernel Lock", which had defaulted to enabled before (though all other preemption was off). I shall run with this for a while and see if there are any improvements.

[1] Edit: Not as definitely as I thought; see below.
Back to top
View user's profile Send private message
yther
Apprentice
Apprentice


Joined: 25 Oct 2002
Posts: 151
Location: Charlotte, NC (USA)

PostPosted: Mon Mar 27, 2006 10:33 pm    Post subject: Curiouser and Curiouser... Reply with quote

I was just out of town for six days, and I shut this box down for that time. This morning I started it up, and it booted without problems on the first try, including the fsck stage! The date and time had not been lost.

So, I suppose I can save the $5 and time it would take to replace the battery, as it seems to still be ok. I am now guessing that either the failure to shut down properly or the lockups during boot is causing the system time to be reset.

FYI: I had run this configuration for a few days without a hard freeze, but that (as we've seen) doesn't necessarily mean anything. I'm not going to post my whole kernel config unless someone wants to see it, but here is the first bit (up to Advanced) of the kernel that's running now.

Does anyone see something there that could cause problems with stability?
Code:
cat /proc/config.gz:

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.14-hardened-r5
# Thu Mar 16 22:41:37 2006
#
CONFIG_MMU=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_PPC=y
CONFIG_PPC32=y
CONFIG_GENERIC_NVRAM=y
CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_CLEAN_COMPILE=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
CONFIG_SYSCTL=y
# CONFIG_AUDIT is not set
CONFIG_HOTPLUG=y
CONFIG_KOBJECT_UEVENT=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_CPUSETS is not set
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_EMBEDDED is not set
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_CC_ALIGN_FUNCTIONS=0
CONFIG_CC_ALIGN_LABELS=0
CONFIG_CC_ALIGN_LOOPS=0
CONFIG_CC_ALIGN_JUMPS=0
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_OBSOLETE_MODPARM=y
CONFIG_MODVERSIONS=y
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y

#
# Processor
#
CONFIG_6xx=y
# CONFIG_40x is not set
# CONFIG_44x is not set
# CONFIG_POWER3 is not set
# CONFIG_POWER4 is not set
# CONFIG_8xx is not set
# CONFIG_E200 is not set
# CONFIG_E500 is not set
CONFIG_PPC_FPU=y
CONFIG_ALTIVEC=y
CONFIG_TAU=y
# CONFIG_TAU_INT is not set
# CONFIG_TAU_AVERAGE is not set
# CONFIG_KEXEC is not set
# CONFIG_CPU_FREQ is not set
# CONFIG_PPC601_SYNC_FIX is not set
# CONFIG_HOTPLUG_CPU is not set
# CONFIG_WANT_EARLY_SERIAL is not set
CONFIG_PPC_STD_MMU=y

#
# Platform options
#
CONFIG_PPC_MULTIPLATFORM=y
# CONFIG_APUS is not set
# CONFIG_KATANA is not set
# CONFIG_WILLOW is not set
# CONFIG_CPCI690 is not set
# CONFIG_POWERPMC250 is not set
# CONFIG_CHESTNUT is not set
# CONFIG_SPRUCE is not set
# CONFIG_HDPU is not set
# CONFIG_EV64260 is not set
# CONFIG_LOPEC is not set
# CONFIG_MVME5100 is not set
# CONFIG_PPLUS is not set
# CONFIG_PRPMC750 is not set
# CONFIG_PRPMC800 is not set
# CONFIG_SANDPOINT is not set
# CONFIG_RADSTONE_PPC7D is not set
# CONFIG_PAL4 is not set
# CONFIG_GEMINI is not set
# CONFIG_EST8260 is not set
# CONFIG_SBC82xx is not set
# CONFIG_SBS8260 is not set
# CONFIG_RPX8260 is not set
# CONFIG_TQM8260 is not set
# CONFIG_ADS8272 is not set
# CONFIG_PQ2FADS is not set
# CONFIG_LITE5200 is not set
# CONFIG_MPC834x_SYS is not set
# CONFIG_EV64360 is not set
CONFIG_PPC_CHRP=y
CONFIG_PPC_PMAC=y
CONFIG_PPC_PREP=y
CONFIG_PPC_OF=y
CONFIG_PPCBUG_NVRAM=y
CONFIG_SMP=y
CONFIG_IRQ_ALL_CPUS=y
CONFIG_NR_CPUS=2
# CONFIG_HIGHMEM is not set
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_BKL is not set
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
# CONFIG_SPARSEMEM_STATIC is not set
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_MISC=m
CONFIG_PROC_DEVICETREE=y
CONFIG_PREP_RESIDUAL=y
CONFIG_PROC_PREPRESIDUAL=y
CONFIG_CMDLINE_BOOL=y
CONFIG_CMDLINE="console=ttyS0,9600 console=tty0"
# CONFIG_PM is not set
CONFIG_SECCOMP=y
CONFIG_ISA_DMA_API=y

#
# Bus options
#
# CONFIG_ISA is not set
CONFIG_GENERIC_ISA_DMA=y
CONFIG_PCI=y
CONFIG_PCI_DOMAINS=y
# CONFIG_PCI_LEGACY_PROC is not set

#
# PCCARD (PCMCIA/CardBus) support
#
# CONFIG_PCCARD is not set

#
# Advanced setup
#
# CONFIG_ADVANCED_OPTIONS is not set
...
Back to top
View user's profile Send private message
yther
Apprentice
Apprentice


Joined: 25 Oct 2002
Posts: 151
Location: Charlotte, NC (USA)

PostPosted: Thu May 04, 2006 11:58 pm    Post subject: Reply with quote

Update for anyone who cares:

New kernel with all PREEMPT settings turned off seems stable while running; I have been using it for several weeks now with zero lockups or spontaneous reboots.

Boot problems remain; usually takes two to three tries before I can get through the startup sequence.
Back to top
View user's profile Send private message
fb
l33t
l33t


Joined: 08 Dec 2003
Posts: 636
Location: New Zealand

PostPosted: Fri May 05, 2006 12:20 am    Post subject: Reply with quote

yther wrote:
Update for anyone who cares:

New kernel with all PREEMPT settings turned off seems stable while running; I have been using it for several weeks now with zero lockups or spontaneous reboots.

Boot problems remain; usually takes two to three tries before I can get through the startup sequence.


I have some kind of problems when I cold boot, ie. I shutdown my iMac G4 for the night and turn it on in the
morning. I remember vaguely that in another threads there was mention of a component that had
to warm up before you can actually boot. The suggestion was to add a delay in the boot sequence
(by passing some parameter to the kernel) to allow it to warm up enough.
I usually take the opportunity to boot OSX and check if there is any update for that side that I
hardly ever use anymore before rebooting in linux now that the computer is warm enough.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Gentoo on PPC All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum