LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   Kernel bug or hardware problem? What do you think? (https://www.linuxquestions.org/questions/linux-general-1/kernel-bug-or-hardware-problem-what-do-you-think-108921/)

JordanH 10-26-2003 11:40 PM

Kernel bug or hardware problem? What do you think?
 
uh-oh.

My RH 9 box has frozen twice in the past two days. That's twice more than I've ever had linux freeze on me in the past and I don't even know where to start solving this problem - please help me track it down.

Firstly, it froze last night at almost 1am, then it froze again today sometime between 3pm and 1am.

Secondly, the messages in /var/log/messages before the crash indicated that there was a kernel bug - I'll post the text in another message in this thread.

Thirdly, I tried compiling another version of my current kernel 2.4.20-8 but had errors so it did not complete successfully. (Something about devlist.h not found but it was needed by names.o - I haven't had time to research that one yet)

Lastly, after rebooting yet again, I notice my memory check only runs up to ~383MB when this machine has 512MB.

Now here are some possibilities...
1. I screwed something when compiling a new kernel which caused instability of my existing kernel.
2. My RAM is starting to burn out wreaking havoc in my system.
3. syslogd bailed and took out the whole box (errrr.... long shot)
4. Alien hackers used their freeze-death-ray on my poor linux router leaving no trace behind them.

*AH* Where do I start to fix this problem? H/W? Kernel??
Any help is appreciated,
J.

(text from logs to follow)

JordanH 10-26-2003 11:45 PM

Last and first messages from reboot today...

Oct 26 14:39:14 Alpha syslogd 1.4.1: restart.
Oct 27 01:14:26 Alpha syslogd 1.4.1: restart.
Oct 27 01:14:26 Alpha syslog: syslogd startup succeeded

JordanH 10-26-2003 11:47 PM

Log from last night...

Oct 26 00:21:59 Alpha kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000074
Oct 26 00:21:59 Alpha kernel: printing eip:
Oct 26 00:21:59 Alpha kernel: c0140d9b
Oct 26 00:21:59 Alpha kernel: *pde = 00000000
Oct 26 00:21:59 Alpha kernel: Oops: 0000
Oct 26 00:21:59 Alpha kernel: udf ipt_limit sg emu10k1 ac97_codec sound soundcore printer sr_mod agpgart nvidia parport_pc lp parport autofs ipt_MASQUERADE ipt_state ipt_LOG iptable_mangle
Oct 26 00:21:59 Alpha kernel: CPU: 0
Oct 26 00:21:59 Alpha kernel: EIP: 0060:[<c0140d9b>] Tainted: P
Oct 26 00:21:59 Alpha kernel: EFLAGS: 00210202
Oct 26 00:21:59 Alpha kernel:
Oct 26 00:21:59 Alpha kernel: EIP is at page_referenced [kernel] 0x227 (2.4.20-8)
Oct 26 00:21:59 Alpha kernel: eax: c1000030 ebx: 00000001 ecx: 00000000 edx: 00000001
Oct 26 00:21:59 Alpha kernel: esi: 0000000d edi: dc893a40 ebp: 00000001 esp: dffb3f84
Oct 26 00:21:59 Alpha kernel: ds: 0068 es: 0068 ss: 0068
Oct 26 00:21:59 Alpha kernel: Process kscand/Normal (pid: 7, stackpage=dffb3000)
Oct 26 00:21:59 Alpha kernel: Stack: dc893680 00000000 00000000 dffb3fb4 c1532650 c1532650 c0303a0c c11116dc
Oct 26 00:21:59 Alpha kernel: 00000003 c0139ade dffb2000 c0124b2c 00000001 00000003 dffb2000 c0303900
Oct 26 00:21:59 Alpha kernel: dffb2000 c013a924 c0303900 00000003 00000001 c025618c 000009c4 c013a868
Oct 26 00:21:59 Alpha kernel: Call Trace: [<c0139ade>] scan_active_list [kernel] 0x36 (0xdffb3fa8))
Oct 26 00:21:59 Alpha kernel: [<c0124b2c>] process_timeout [kernel] 0x0 (0xdffb3fb0))
Oct 26 00:21:59 Alpha kernel: [<c013a924>] kscand [kernel] 0xbc (0xdffb3fc8))
Oct 26 00:21:59 Alpha kernel: [<c013a868>] kscand [kernel] 0x0 (0xdffb3fe0))
Oct 26 00:21:59 Alpha kernel: [<c0107389>] kernel_thread_helper [kernel] 0x5 (0xdffb3ff0))
Oct 26 00:21:59 Alpha kernel:
Oct 26 00:21:59 Alpha kernel:
Oct 26 00:21:59 Alpha kernel: Code: 8b 41 74 39 41 60 0f 43 54 24 04 45 4e 89 54 24 04 0f 89 3e
Oct 26 00:31:49 Alpha modprobe: modprobe: Can't locate module sound-slot-1
Oct 26 00:31:49 Alpha modprobe: modprobe: Can't locate module sound-service-1-0
Oct 26 00:31:49 Alpha modprobe: modprobe: Can't locate module sound-slot-1
Oct 26 00:31:49 Alpha modprobe: modprobe: Can't locate module sound-service-1-0
Oct 26 00:47:10 Alpha kernel: ------------[ cut here ]------------
Oct 26 00:47:10 Alpha kernel: kernel BUG at page_alloc.c:139!
Oct 26 00:47:10 Alpha kernel: invalid operand: 0000
Oct 26 00:47:10 Alpha kernel: udf ipt_limit sg emu10k1 ac97_codec sound soundcore printer sr_mod agpgart nvidia parport_pc lp parport autofs ipt_MASQUERADE ipt_state ipt_LOG iptable_mangle
Oct 26 00:47:10 Alpha kernel: CPU: 0
Oct 26 00:47:10 Alpha kernel: EIP: 0060:[<c013b57d>] Tainted: P
Oct 26 00:47:10 Alpha kernel: EFLAGS: 00210282
Oct 26 00:47:10 Alpha kernel:
Oct 26 00:47:10 Alpha kernel: EIP is at __free_pages_ok [kernel] 0xdd (2.4.20-8)
Oct 26 00:47:10 Alpha kernel: eax: 01000018 ebx: c1532650 ecx: c1000030 edx: dc893a40
Oct 26 00:47:10 Alpha kernel: esi: 00000000 edi: 00000000 ebp: 00000000 esp: d43bddec
Oct 26 00:47:10 Alpha kernel: ds: 0068 es: 0068 ss: 0068
Oct 26 00:47:10 Alpha kernel: Process wish (pid: 4180, stackpage=d43bd000)
Oct 26 00:47:10 Alpha kernel: Stack: 000075ff 00200296 c0303b84 00200296 c0303900 c1038030 c0303b0c cf806374
Oct 26 00:47:10 Alpha kernel: cf806374 00100000 c1532650 cf806374 00100000 17c1c045 c012c6c8 c1532650
Oct 26 00:47:10 Alpha kernel: 00093000 c012eab7 c4afc0c0 08893000 cf806374 c0118ce7 00000094 08c00000
Oct 26 00:47:10 Alpha kernel: Call Trace: [<c012c6c8>] __free_pte [kernel] 0x4c (0xd43bde24))
Oct 26 00:47:10 Alpha kernel: [<c012eab7>] zap_pte_range [kernel] 0x12f (0xd43bde30))
Oct 26 00:47:10 Alpha kernel: [<c0118ce7>] sys_sched_yield [kernel] 0x73 (0xd43bde40))
Oct 26 00:47:10 Alpha kernel: [<c012cd1b>] zap_page_range [kernel] 0xc7 (0xd43bde58))
Oct 26 00:47:10 Alpha kernel: [<c012ffcf>] exit_mmap [kernel] 0xb3 (0xd43bde98))
Oct 26 00:47:10 Alpha kernel: [<c01196bb>] mmput [kernel] 0x47 (0xd43bdebc))
Oct 26 00:47:10 Alpha kernel: [<c011e991>] do_exit [kernel] 0xf1 (0xd43bdecc))
Oct 26 00:47:10 Alpha kernel: [<c011ec08>] do_group_exit [kernel] 0x50 (0xd43bdee8))
Oct 26 00:47:10 Alpha kernel: [<c012674d>] get_signal_to_deliver [kernel] 0x19d (0xd43bdef8))
Oct 26 00:47:10 Alpha kernel: [<c0109184>] do_signal [kernel] 0x68 (0xd43bdf20))
Oct 26 00:47:11 Alpha kernel: [<e081d03d>] ext3_file_write [ext3] 0x39 (0xd43bdf78))
Oct 26 00:47:11 Alpha kernel: [<c01268f8>] sys_rt_sigprocmask [kernel] 0xc8 (0xd43bdf94))
Oct 26 00:47:11 Alpha kernel: [<c01093ec>] signal_return [kernel] 0x14 (0xd43bdfc0))
Oct 26 00:47:11 Alpha kernel:
Oct 26 00:47:11 Alpha kernel:
Oct 26 00:47:11 Alpha kernel: Code: 0f 0b 8b 00 9b 61 25 c0 8b 43 18 89 f9 89 de 83 e0 eb 89 43
Oct 26 00:47:16 Alpha gdm(pam_unix)[2445]: session closed for user <name removed>
Oct 26 00:47:18 Alpha su(pam_unix)[6145]: session closed for user <name removed>
Oct 26 00:47:18 Alpha su(pam_unix)[14803]: session closed for user <name removed>
Oct 26 00:47:20 Alpha gdm[2445]: gdm_slave_xioerror_handler: Fatal X error - Restarting :0
Oct 26 00:49:16 Alpha gconfd (<name removed>-5848): GConf server is not in use, shutting down.
Oct 26 00:49:16 Alpha gconfd (<name removed>-5848): Exiting

tgflynn 10-27-2003 10:28 AM

If you only did make bzImage (saw this from your post on the USB thread) it wouldn't have affected your running kernel. (If you had done make install, it might have).

My guess would be bad RAM.

Where are you seeing the memory size message that changed ? Try running the command free and checking if the memory total line agrees with the amount of installed memory.

There's a program called memtest that runs thorough tests on your memory. I can't find a homepage for it but here's the freshmeat url :

http://freshmeat.net/projects/memtest/?topic_id=136

If memory serves you need to install it on a floppy and then boot the floppy. I think the tarball contains detailed instructions.

If it does turn out to be a RAM problem you might want to try reseating the DIMM's before buying new memory.

Tim

JordanH 10-27-2003 12:57 PM

Thanks for the reply.

Currently, RAM is my best guess too since the memory check at bootup only sees 393,###kb (~383MB) which should be kernel independent.

However, the timing of the memory, kernel BUG and kernel compilation is too close for comfort...

Looking forward, is there anything special I need to do to the kernel if I decide to add or remove RAM? If I'm forced to run 3x128MB instead of the expected 4x128MB, is there anything I need to change or recompile? (arg, I can't believe I have to ask this question. I feel 'new' all over again.)

tgflynn 10-27-2003 01:28 PM

The kernel bug may very well be a symptom of bad memory.

Again if all you did was compile (no install) that really shouldn't have affected anything.

Did you check the memory size given by free ?

You don't have to do anything to the kernel if you change RAM DIMM's. Its purely a hardware matter.

Tim

JordanH 10-27-2003 01:31 PM

Sorry, I'm still at work and won't be able to check the memory or memory free until I am home this evening. I'll let you know ASAP my results. I'll also be pulling, pushing, prodding, shoving, nudging, reseating and swearing at the RAM and will post those results as well (perhaps, not the swearing ;) )

Thanks for clearing up the RAM question... I figured as much but wasn't sure if that was a possible cause for the Kernel Bug.

I will keep you posted on my trials and tribulations.
J.

JordanH 10-27-2003 07:14 PM

Oh wow... totally pooched.

It turns out I have 2x256MB of ram... How only 128MB didn't register is a new one to me. However, after my poking and prodding, both chips registered correctly and the memtest showed correctly.

Now it gets fun...

I ran free but the Total ram was about 501MB... ok, so mayb 11MB is off hiding someplace; either way, I downloaded and tried to run that memtest from freshmeat (as per above)... BIG MISTAKE I installed and ran it from /tmp/memtest/ and it corrupted the whole /tmp tree! *AH* I have no idea what else it has corrupted but fsck just went NUTS when running in maintenance mode. The errors were too numerous to list here...

Several reboots and different attempts later, I can't boot into X. I've had at least one Kernel Panic and right now, it's just blinking the nVidia splash screen as if it is trying to reload itself everytime it crashes. Oye.

Another couple tries and then a re-install... I hope I didn't lose any /etc configs or /home data... o_O *eek*

JordanH 10-27-2003 08:23 PM

edit: removed my comments. The problem has been narrowed down to a partially working stick of RAM.

tgflynn 10-27-2003 08:53 PM

I'm very sorry about memtest. It turns out I pointed you to the wrong program.

It turns out the program I was talking about is now called memtest86 (I think it used to be called just memtest). memtest86 doesn't even run under Linux. Its a stand alone program you run from a floppy that just runs RAM tests.

I should have read the Freshmeat description more carefully but it never occured to me there would be an entirely different program with such a similar name.

Tim

moeminhtun 10-27-2003 09:24 PM

I'm also having some problems with Redhat 9.0 personal using as a server. It's already hang 2 times.
I've found that it's gradually increasing the memory usage. I'm only running the default applications and servers comes with the redhat 9. I don't know which application or server has got memory leckage. Still finding out.

JordanH 10-27-2003 10:42 PM

Don't be sorry 'bout the memtest, I should have read about it before just blasting it at my system.

I was able to save my /home & /etc directories, however, /tmp was beyond repair (how that happened, I don't understand). There was something wrong with the /home tree as well and I had to cp the directories to a new directory before I could get a successful tarball. What a PITA.

Time to blow away the machine and start again... maybe I'll try Fedora core or United. *sigh*

Robert0380 10-28-2003 12:55 AM

GENTOO!!!!!

tgflynn 10-28-2003 06:42 AM

Quote:

Originally posted by JordanH


I was able to save my /home & /etc directories, however, /tmp was beyond repair (how that happened, I don't understand).


Well that memtest program is designed to stress test the kernel's memory management system. Doing this with bad RAM is probably a good recipe for making the kernel misbehave badly and file system corruption is certainly a possibility. Its the last thing you'd want to be running in such a situation.

I think I'll try to contact the maintainer of memtest to see if he'd be willing to put a big visible warning in the README about this not being memtest86 and not for testing RAM. Maybe that would help keep this kind of thing from happening to someone else.

Tim


All times are GMT -5. The time now is 12:28 AM.