Linux keeps crashing under high load..help
Well I have had this Linux box built for ~6 months now and the longest I was able to keep it up was ~ 15 days in a row.
I am running an XP2100 with a Shuttle motherboard ak35gt and 1.25 Gig of ram with a 400 Watt PS which is supposed ot be one of the best and I have 3 HD's 1 Floppy and 1 CD-RW. So Wheneve I go to copy large amounts of data form one drive to the next, or when I am bzipping 20 Files in a row or anything like this..It get a crash and a reboot. any ideas how to de-bug this? thank |
Are there any clues in /var/log/messages?
|
It carshed 3 more times. I removed 1 memory module to see if that helps..it doesnt.
I opened up my case to let more airflow in ..it didnt help. I did see one useful message(not). Segmentation fault and the whole system freezes. /var/log/messages has nothing in it to help. I now turned down my FSB to 130 from 133. It didnt help. Any other suggestions? |
Have you tried running any other OS's on it? Maybe a different kernel? Maybe you could try running it on a live-cd like knoppix for awhile.
|
I do need a BIOS flash because I upgraded to XP processor and doesnt recoginze this..you think this may be the problem?
|
maybe this will work
I would try this:
1. shut down all non essential programms, run in console mode with no extras on (punch down your runlevel) 2. remove all non-essential hardware our of your system. This sounds very much like a hardware thing to me. (I guess that is why it is correctly placed in the hardware forum ;-) 3. Grade down anything that you can in the BIOS. Can you reduce the ramspeed? Is there a compatibility setting? (more stable or more performance switch) 4. Try another distro, I can especially recommend Knopix, since you can boot it off a CD and that will not change/harm your config in any way. Then torture the machine a little, run heavy stuff on it and see if it crashes again. 5. Play a game on it if you can, since they usually use a lot of juice from the system. Does that crash too? What distro are you using? Is the dmesg (bootmessage thing of the kernel) saying anything strange while recognizing your hardware during bootup? Cheers Markus |
funny
Two people were having the idea of Knoppix at the same time ;-)
|
thanks for the reply.
I am using RH 8.0 What can I do in knoppix to strain the system? also, here is my dmesg: Linux version 2.4.20-18.8 (compile@daffy.perf.redhat.com) (gcc version 3.2 20020903 (Red Hat Linux 8.0 3.2 -7)) #1 Thu May 29 07:20:39 EDT 2003 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 00000000000a0000 (usable) BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 000000003fff0000 (usable) BIOS-e820: 000000003fff0000 - 000000003fff3000 (ACPI NVS) BIOS-e820: 000000003fff3000 - 0000000040000000 (ACPI data) BIOS-e820: 00000000ffff0000 - 0000000100000000 (reserved) 127MB HIGHMEM available. 896MB LOWMEM available. On node 0 totalpages: 262128 zone(0): 4096 pages. zone(1): 225280 pages. zone(2): 32752 pages. Kernel command line: auto BOOT_IMAGE=linux ro BOOT_FILE=/boot/vmlinuz-2.4.20-18.8 hdd=ide-scsi root=LABEL=/ ide_setup: hdd=ide-scsi Initializing CPU#0 Detected 1696.415 MHz processor. Console: colour VGA+ 80x25 Calibrating delay loop... 3381.65 BogoMIPS Memory: 1027612k/1048512k available (1310k kernel code, 17320k reserved, 995k data, 132k init, 131008k highm em) Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes) Inode cache hash table entries: 65536 (order: 7, 524288 bytes) Mount cache hash table entries: 512 (order: 0, 4096 bytes) Buffer-cache hash table entries: 65536 (order: 6, 262144 bytes) Page-cache hash table entries: 262144 (order: 8, 1048576 bytes) CPU: CLK_CTL MSR was 6003d22f. Reprogramming to 2003d22f CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 256K (64 bytes/line) Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. CPU: After generic, caps: 0383f9ff c1c3f9ff 00000000 00000000 CPU: Common caps: 0383f9ff c1c3f9ff 00000000 00000000 CPU: AMD Unknown CPU Type stepping 01 Enabling fast FPU save and restore... done. Enabling unmasked SIMD FPU exception support... done. Checking 'hlt' instruction... OK. POSIX conformance testing by UNIFIX mtrr: v1.40 (20010327) Richard Gooch (rgooch@atnf.csiro.au) mtrr: detected mtrr type: Intel PCI: PCI BIOS revision 2.10 entry at 0xfb440, last bus=1 PCI: Using configuration type 1 PCI: Probing PCI hardware PCI: Using IRQ router default [1106/3099] at 00:00.0 isapnp: Scanning for PnP cards... isapnp: No Plug & Play device found Linux NET4.0 for Linux 2.4 Based upon Swansea University Computer Society NET3.039 Initializing RT netlink socket apm: BIOS version 1.2 Flags 0x07 (Driver version 1.16) Starting kswapd allocated 32 pages and 32 bhs reserved for the highmem bounces VFS: Disk quotas vdquot_6.5.1 Detected PS/2 Mouse Port. pty: 2048 Unix98 ptys configured Serial driver version 5.05c (2001-07-08) with MANY_PORTS MULTIPORT SHARE_IRQ SERIAL_PCI ISAPNP enabled ttyS0 at 0x03f8 (irq = 4) is a 16550A ttyS1 at 0x02f8 (irq = 3) is a 16550A Real Time Clock Driver v1.10e Floppy drive(s): fd0 is 1.44M FDC 0 is a post-1991 82077 NET4: Frame Diverter 0.46 RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize Uniform Multi-Platform E-IDE driver Revision: 7.00beta3-.2.4 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx VP_IDE: IDE controller at PCI slot 00:11.1 VP_IDE: chipset revision 6 VP_IDE: not 100% native mode: will probe irqs later VP_IDE: VIA vt8233a (rev 00) IDE UDMA133 controller on pci00:11.1 ide0: BM-DMA at 0xe000-0xe007, BIOS settings: hda:DMA, hdb:DMA ide1: BM-DMA at 0xe008-0xe00f, BIOS settings: hdc:DMA, hdd:DMA hda: MAXTOR 6L080J4, ATA DISK drive hdb: Maxtor 33073H3, ATA DISK drive blk: queue c039eba0, I/O limit 4095Mb (mask 0xffffffff) blk: queue c039ece4, I/O limit 4095Mb (mask 0xffffffff) hdc: Maxtor 92041U4, ATA DISK drive hdd: IDE DVD-ROM 16X, ATAPI CD/DVD-ROM drive blk: queue c039f004, I/O limit 4095Mb (mask 0xffffffff) ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 ide1 at 0x170-0x177,0x376 on irq 15 hda: attached ide-disk driver. hda: host protected area => 1 hda: 156355584 sectors (80054 MB) w/1819KiB Cache, CHS=9732/255/63, UDMA(133) hdb: attached ide-disk driver. hdb: host protected area => 1 hdb: 60032448 sectors (30737 MB) w/2048KiB Cache, CHS=59556/16/63, UDMA(100) hdc: attached ide-disk driver. hdc: host protected area => 1 hdc: 40020624 sectors (20491 MB) w/512KiB Cache, CHS=39703/16/63, UDMA(66) ide-floppy driver 0.99.newide Partition check: hda: hda1 hda2 hda3 hda4 < hda5 hda6 hda7 hda8 hda9 > hdb: hdb1 hdb2 hdb3 hdb4 hdc: [PTBL] [2491/255/63] hdc1 hdc2 < hdc5 hdc6 hdc7 hdc8 hdc9 > ide-floppy driver 0.99.newide md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27 md: Autodetecting RAID arrays. md: autorun ... md: ... autorun DONE. NET4: Linux TCP/IP 1.0 for NET4.0 IP Protocols: ICMP, UDP, TCP, IGMP IP: routing cache hash table of 8192 buckets, 64Kbytes TCP: Hash tables configured (established 262144 bind 65536) Linux IP multicast router 0.06 plus PIM-SM NET4: Unix domain sockets 1.0/SMP for Linux NET4.0. RAMDISK: Compressed image found at block 0 Freeing initrd memory: 125k freed VFS: Mounted root (ext2 filesystem). Journalled Block Device driver loaded EXT3-fs: INFO: recovery required on readonly filesystem. EXT3-fs: write access will be enabled during recovery. kjournald starting. Commit interval 5 seconds EXT3-fs: recovery complete. EXT3-fs: mounted filesystem with ordered data mode. Freeing unused kernel memory: 132k freed usb.c: registered new driver usbdevfs usb.c: registered new driver hub usb-uhci.c: $Revision: 1.275 $ time 07:35:30 May 29 2003 usb-uhci.c: High bandwidth mode enabled usb-uhci.c: USB UHCI at I/O 0xe400, IRQ 12 usb-uhci.c: Detected 2 ports usb.c: new USB bus registered, assigned bus number 1 hub.c: USB hub found hub.c: 2 ports detected usb-uhci.c: USB UHCI at I/O 0xe800, IRQ 12 usb-uhci.c: Detected 2 ports usb.c: new USB bus registered, assigned bus number 2 hub.c: USB hub found hub.c: 2 ports detected usb-uhci.c: v1.275:USB Universal Host Controller Interface driver usb.c: registered new driver hiddev usb.c: registered new driver hid hid-core.c: v1.8.1 Andreas Gal, Vojtech Pavlik <vojtech@suse.cz> hid-core.c: USB HID support drivers mice: PS/2 mouse device common for all mice EXT3 FS 2.4-0.9.19, 19 August 2002 on ide0(3,3), internal journal Adding Swap: 875500k swap-space (priority -1) kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.19, 19 August 2002 on ide0(3,1), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.19, 19 August 2002 on ide0(3,2), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.19, 19 August 2002 on ide0(3,8), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.19, 19 August 2002 on ide0(3,7), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.19, 19 August 2002 on ide0(3,6), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.19, 19 August 2002 on ide0(3,5), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.19, 19 August 2002 on ide1(22,1), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.19, 19 August 2002 on ide1(22,9), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.19, 19 August 2002 on ide1(22,8), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.19, 19 August 2002 on ide1(22,6), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.19, 19 August 2002 on ide1(22,7), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.19, 19 August 2002 on ide0(3,65), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.19, 19 August 2002 on ide0(3,66), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.19, 19 August 2002 on ide0(3,67), internal journal EXT3-fs: mounted filesystem with ordered data mode. kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.19, 19 August 2002 on ide0(3,68), internal journal EXT3-fs: mounted filesystem with ordered data mode. SCSI subsystem driver Revision: 1.00 hdd: attached ide-scsi driver. scsi0 : SCSI host adapter emulation for IDE ATAPI devices Vendor: IDE Model: DVD-ROM 16X Rev: 3.10 Type: CD-ROM ANSI SCSI revision: 02 parport0: PC-style at 0x378 [PCSPP,TRISTATE] ip_tables: (C) 2000-2002 Netfilter core team Linux Tulip driver version 0.9.15-pre12 (Aug 9, 2002) tulip0: Transceiver selection forced to 100baseTx. divert: allocating divert_blk for eth0 eth0: ADMtek Comet rev 17 at 0xf890f000, , IRQ 5. divert: allocating divert_blk for eth1 eth1: ADMtek Comet rev 17 at 0xf8911000, , IRQ 12. ip_conntrack version 2.1 (8191 buckets, 65528 max) - 292 bytes per conntrack |
Sounds like a possible PSU issue. I know you said in the first message that the PSU is 'supposed to be one of the best', but even so a faulty or underpowered PSU would certainly cause stability problems, especially with all of the hardware you have. As I have learned just recently (props to Geoff_f and Rav) your specific CPU will normally pull power from the 12v rail of your PSU. Your HDs also pull from the 12v rail, so it is possible that when you are doing a HD intensive operation (such as copying large files between drives) that it is stealing enough juice to underpower the CPU to the point of failure.
This is my best guess (but still just a guess mind you). Can you post the specs of your PSU? Specifically the 3.3v, 5v, and 12v ratings along with the TCO? slight |
Finally some info...
I flashed my BIOS and restarted the mkisofs command of my 4 Gig iso. I got these kernel messages and then a system halt:
Jun 5 22:14:53 bruce -- root[1476]: ROOT LOGIN ON tty1 Jun 5 22:18:47 bruce kernel: Unable to handle kernel paging request at virtual address 005cd044 Jun 5 22:18:47 bruce kernel: printing eip: Jun 5 22:18:47 bruce kernel: c0136692 Jun 5 22:18:47 bruce kernel: *pde = 00000000 Jun 5 22:18:47 bruce kernel: Oops: 0000 Jun 5 22:18:47 bruce kernel: autofs ipt_TOS ipt_REJECT ipt_LOG ipt_limit ipt_MASQUERADE ipt_state iptable_mangle iptable_nat ip_conntrack tulip iptable_filter ip_tables ide-scsi scsi_mod Jun 5 22:18:47 bruce kernel: CPU: 0 Jun 5 22:18:47 bruce kernel: EIP: 0010:[<c0136692>] Not tainted Jun 5 22:18:47 bruce kernel: EFLAGS: 00010246 Jun 5 22:18:47 bruce kernel: Jun 5 22:18:47 bruce kernel: EIP is at scan_active_list [kernel] 0x42 (2.4.20-18.8) Jun 5 22:18:47 bruce kernel: eax: 005cd044 ebx: c15ccff0 ecx: 00000000 edx: c36bffac Jun 5 22:18:47 bruce kernel: esi: c03027f4 edi: 005cd044 ebp: 00000000 esp: c36bffa4 Jun 5 22:19:03 bruce kernel: ds: 0018 es: 0018 ss: 0018 Jun 5 22:19:34 bruce kernel: Process kscand (pid: 6, stackpage=c36bf000) Jun 5 22:19:48 bruce kernel: Stack: c01174c0 00000000 00000001 c36be000 00000000 c0302680 00000000 c01374f4 Jun 5 22:19:54 bruce kernel: c0302680 00000000 00000000 c36be000 00000001 00000000 000001f4 00010f00 Jun 5 22:19:54 bruce kernel: c1e17fa0 c0105000 0008e000 c010726e 00000000 c01373f0 c1e16000 Jun 5 22:19:54 bruce kernel: Call Trace: [<c01174c0>] process_timeout [kernel] 0x0 (0xc36bffa4)) Jun 5 22:19:54 bruce kernel: [<c01374f4>] kscand [kernel] 0x104 (0xc36bffc0)) Jun 5 22:19:54 bruce kernel: [<c0105000>] stext [kernel] 0x0 (0xc36bffe8)) Jun 5 22:19:54 bruce kernel: [<c010726e>] arch_kernel_thread [kernel] 0x2e (0xc36bfff0)) Jun 5 22:19:54 bruce kernel: [<c01373f0>] kscand [kernel] 0x0 (0xc36bfff8)) Jun 5 22:19:54 bruce kernel: Jun 5 22:19:54 bruce kernel: Jun 5 22:19:54 bruce kernel: Code: 8b 3f 39 f0 75 de 83 c4 0c 31 c0 5b 5e 5f 5d c3 89 1c 24 89 Jun 5 22:19:54 bruce kernel: <1>Unable to handle kernel paging request at virtual address 2035c04c Jun 5 22:19:54 bruce kernel: printing eip: Jun 5 22:19:54 bruce kernel: c01341ea Jun 5 22:19:54 bruce kernel: *pde = 00000000 Jun 5 22:19:54 bruce kernel: Oops: 0000 Jun 5 22:19:54 bruce kernel: autofs ipt_TOS ipt_REJECT ipt_LOG ipt_limit ipt_MASQUERADE ipt_state iptable_mangle iptable_nat ip_conntrack tulip iptable_filter ip_tables ide-scsi scsi_mod Jun 5 22:19:54 bruce kernel: CPU: 0 Jun 5 22:19:54 bruce kernel: EIP: 0010:[<c01341ea>] Not tainted Jun 5 22:19:54 bruce kernel: EFLAGS: 00010002 Jun 5 22:19:54 bruce kernel: Jun 5 22:19:54 bruce kernel: EIP is at __kmem_cache_alloc [kernel] 0x4a (2.4.20-18.8) Jun 5 22:19:54 bruce kernel: eax: 9500000d ebx: 9500000d ecx: cc35c000 edx: ef402980 Jun 5 22:19:54 bruce kernel: esi: c36b3e40 edi: 00000246 ebp: 000000f0 esp: f7027dfc Jun 5 22:19:54 bruce kernel: ds: 0018 es: 0018 ss: 0018 Jun 5 22:19:54 bruce kernel: Process mkisofs (pid: 1525, stackpage=f7027000) Jun 5 22:19:54 bruce kernel: Stack: 0028d2b8 c3664dc0 00000000 00000000 00001000 00000001 c0142bbf c36b3e40 Jun 5 22:19:54 bruce kernel: 000000f0 c0142c58 00000001 00000000 0028d2b8 00000342 c1579f40 c1f2a960 Jun 5 22:19:54 bruce kernel: ebe979c0 c0142ed5 c1579f40 00001000 00000001 0000000c 00000000 c0143806 Jun 5 22:19:54 bruce kernel: Call Trace: [<c0142bbf>] get_unused_buffer_head [kernel] 0x3f (0xf7027e14)) Jun 5 22:19:54 bruce kernel: [<c0142c58>] create_buffers [kernel] 0x28 (0xf7027e20)) Jun 5 22:19:54 bruce kernel: [<c0142ed5>] create_empty_buffers [kernel] 0x25 (0xf7027e40)) Jun 5 22:19:54 bruce kernel: [<c0143806>] block_read_full_page [kernel] 0x2a6 (0xf7027e58)) Jun 5 22:19:54 bruce kernel: [<c01352fd>] lru_cache_add [kernel] 0x11d (0xf7027e8c)) Jun 5 22:19:54 bruce kernel: [<c012d870>] page_cache_read [kernel] 0xc0 (0xf7027eb4)) Jun 5 22:19:54 bruce kernel: [<f881f540>] ext3_get_block [ext3] 0x0 (0xf7027ebc)) Jun 5 22:19:54 bruce kernel: [<c012e00b>] generic_file_readahead [kernel] 0xdb (0xf7027edc)) Jun 5 22:19:54 bruce kernel: [<c012e49f>] do_generic_file_read [kernel] 0x36f (0xf7027f0c)) Jun 5 22:19:54 bruce kernel: [<c012e870>] file_read_actor [kernel] 0x0 (0xf7027f38)) Jun 5 22:19:54 bruce kernel: [<c012e9f0>] generic_file_read [kernel] 0xb0 (0xf7027f58)) Jun 5 22:19:54 bruce kernel: [<c012e870>] file_read_actor [kernel] 0x0 (0xf7027f68)) Jun 5 22:19:54 bruce kernel: [<c01407e3>] sys_read [kernel] 0xa3 (0xf7027f94)) Jun 5 22:19:54 bruce kernel: [<c0108dff>] system_call [kernel] 0x33 (0xf7027fc0)) Jun 5 22:19:54 bruce kernel: SYSTEM HALTED..I had to do a Manual REBOOT with the case switch. It actually happened twice during the process..once 1/2 way through the 2nd 80% through. Then it segmentation Faulted. Any ideas?? It looks like this is a memory issue no?? |
Well after all of this I did some searching on google and found several people who have the same problem as I. the common denominator was that we are all using kernel version 2.4.20.x
So, I booted with kernel 2.4.18 form March and I ran mkisofs twice and created 2 4 gig files and there have been no problems. I ma now tarring and | bzip2 a 10 gig directory and I'd say it is 40% through and thus far no errors. So in conclusion whatever the developers did from 2.4.18 to 2.4.20 for paging kernel parameters, they need to undo. Anyone know where I can file a bug/request for these types of issues? Thanks |
Not so fast
Well, It worked a little better but not 100% better.
so now I changed the Power supply form a 400w allied(12A on 12v rails) to a 450w allied(18A on 12v rails). so far so good, but it has only been 2 hours of heavy usage. i will keep everyone update. If this works I will try the 2.4.20 to see if it is stable with that kernel or not. thanks for the suggestions. |
Hope everything works OK.
In the future what you might want to do is install lmsensors and use gkrellm to monitor your core voltages, that way you can see if your core voltages drop during heavy usage. You can also monitor temps this way as well, as heat can also be a contributing factor to instability. slight |
slight..
I have lmsensors, but I am having a basic problem of figuring out my current .config file for the kernel rebuild. I have redhat and use thier automatic update whcih installs the src in /usr/src but where can I find my current config so I dont have to go throught the manual xconfig and answer 100 questions. |
when you say config, do you mean the kernel peramaters, or the Xconfig? The X config for redhat should be /etc/X11/XF86config, AFAIK.
I thought that lmsensors could be compiled as a module though, so I'm not sure that a kernel recompile is necessary. I haven't played around with lmsensors in a while though, so I might be wrong. slight |
All times are GMT -5. The time now is 10:46 PM. |