System freeze every few minutes

pickarooney · 06-30-2007, 02:32 AM

For the last week my system has been locking up constantly, and more and more frequently. I'm at my wit's end as to what to do about it. If anyone can shed any light on the problem...
I've been testing this evening (5 crashes in an hour ) and the crashes happen like so:
The cursor stops blinking (or images stop moving if I'm watching a video), everything stops, I can move the mouse once - it jumps a few cm - and then everything freezes up. On one occasion, after five minutes it came back alive again, but other times I can leave it for over an hour and nothing moves.
I don't think it's a heating problem - nothing is particularly hot in the system and I left the PC running all day with a liveCD and there was no crash (I'm currently logged in from a LiveCD after 6 crashes in a row, all Ok so far touch wood).
I've also been running top constantly and processes haven't been going over 60% when crashes occur.

unSpawn · 06-30-2007, 03:09 AM

Just a hunch. Try booting your current kernel with "noapic nolapic noacpi" boot args.

pickarooney · 06-30-2007, 03:48 AM

Quote:

Originally Posted by unSpawn

Just a hunch. Try booting your current kernel with "noapic nolapic noacpi" boot args.

I wanted to do that, as I get the following message when I boot:

Ms-BIOS bug 824 timer not connected to blahblahblah

I tried adding noapic to menus.lst on the line with the kernel name, but the OS would not boot with this parameter. I think I might have put it in the wrong place. Where exactly should the three options above go in this file:

Code:

# menu.lst - See: grub(8), info grub, update-grub(8)
#            grub-install(8), grub-floppy(8),
#            grub-md5-crypt, /usr/share/doc/grub
#            and /usr/share/doc/grub-doc/.

## default num
# Set the default entry to the entry number NUM. Numbering starts from 0, and
# the entry number 0 is the default if the command is not used.
#
# You can specify 'saved' instead of a number. In this case, the default entry
# is the entry saved with the command 'savedefault'.
# WARNING: If you are using dmraid do not change this entry to 'saved' or your
# array will desync and will not let you boot your system.
default		0

## timeout sec
# Set a timeout, in SEC seconds, before automatically booting the default entry
# (normally the first entry defined).
timeout		3

## hiddenmenu
# Hides the menu by default (press ESC to see the menu)
hiddenmenu

# Pretty colours
#color cyan/blue white/blue

## password ['--md5'] passwd
# If used in the first section of a menu file, disable all interactive editing
# control (menu entry editor and command-line)  and entries protected by the
# command 'lock'
# e.g. password topsecret
#      password --md5 $1$gLhU0/$aW78kHK1QfV3P2b2znUoe/
# password topsecret

#
# examples
#
# title		Windows 95/98/NT/2000
# root		(hd0,0)
# makeactive
# chainloader	+1
#
# title		Linux
# root		(hd0,1)
# kernel	/vmlinuz root=/dev/hda2 ro
#

#
# Put static boot stanzas before and/or after AUTOMAGIC KERNEL LIST

### BEGIN AUTOMAGIC KERNELS LIST
## lines between the AUTOMAGIC KERNELS LIST markers will be modified
## by the debian update-grub script except for the default options below

## DO NOT UNCOMMENT THEM, Just edit them to your needs

## ## Start Default Options ##
## default kernel options
## default kernel options for automagic boot options
## If you want special options for specific kernels use kopt_x_y_z
## where x.y.z is kernel version. Minor versions can be omitted.
## e.g. kopt=root=/dev/hda1 ro
##      kopt_2_6_8=root=/dev/hdc1 ro
##      kopt_2_6_8_2_686=root=/dev/hdc2 ro
# kopt=root=UUID=b9348ed0-9064-4863-8f17-b4a7da2ff060 ro

## Setup crashdump menu entries
## e.g. crashdump=1
# crashdump=0

## default grub root device
## e.g. groot=(hd0,0)
# groot=(hd2,0)

## should update-grub create alternative automagic boot options
## e.g. alternative=true
##      alternative=false
# alternative=true

## should update-grub lock alternative automagic boot options
## e.g. lockalternative=true
##      lockalternative=false
# lockalternative=false

## additional options to use with the default boot option, but not with the
## alternatives
## e.g. defoptions=vga=791 resume=/dev/hda5
# defoptions=quiet splash

## should update-grub lock old automagic boot options
## e.g. lockold=false
##      lockold=true
# lockold=false

## Xen hypervisor options to use with the default Xen boot option
# xenhopt=

## Xen Linux kernel options to use with the default Xen boot option
# xenkopt=console=tty0

## altoption boot targets option
## multiple altoptions lines are allowed
## e.g. altoptions=(extra menu suffix) extra boot options
##      altoptions=(recovery) single
# altoptions=(recovery mode) single

## controls how many kernels should be put into the menu.lst
## only counts the first occurence of a kernel, not the
## alternative kernel options
## e.g. howmany=all
##      howmany=7
# howmany=all

## should update-grub create memtest86 boot option
## e.g. memtest86=true
##      memtest86=false
# memtest86=true

## should update-grub adjust the value of the default booted system
## can be true or false
# updatedefaultentry=false

## ## End Default Options ##

title		Ubuntu, kernel 2.6.20-16-generic
root		(hd0,0)
kernel		/vmlinuz-2.6.20-16-generic root=UUID=b9348ed0-9064-4863-8f17-b4a7da2ff060 ro quiet splash
initrd		/initrd.img-2.6.20-16-generic
quiet
savedefault

title		Ubuntu, kernel 2.6.20-16-generic (recovery mode)
root		(hd0,0)
kernel		/vmlinuz-2.6.20-16-generic root=UUID=b9348ed0-9064-4863-8f17-b4a7da2ff060 ro single
initrd		/initrd.img-2.6.20-16-generic

title		Ubuntu, kernel 2.6.17-11-generic
root		(hd0,0)
kernel		/vmlinuz-2.6.17-11-generic root=UUID=b9348ed0-9064-4863-8f17-b4a7da2ff060 ro quiet splash
initrd		/initrd.img-2.6.17-11-generic
quiet
savedefault

title		Ubuntu, kernel 2.6.17-11-generic (recovery mode)
root		(hd0,0)
kernel		/vmlinuz-2.6.17-11-generic root=UUID=b9348ed0-9064-4863-8f17-b4a7da2ff060 ro single
initrd		/initrd.img-2.6.17-11-generic

title		Ubuntu, kernel 2.6.17-10-generic
root		(hd0,0)
kernel		/vmlinuz-2.6.17-10-generic root=UUID=b9348ed0-9064-4863-8f17-b4a7da2ff060 ro quiet splash
initrd		/initrd.img-2.6.17-10-generic
quiet
savedefault

title		Ubuntu, kernel 2.6.17-10-generic (recovery mode)
root		(hd0,0)
kernel		/vmlinuz-2.6.17-10-generic root=UUID=b9348ed0-9064-4863-8f17-b4a7da2ff060 ro single
initrd		/initrd.img-2.6.17-10-generic

title		Ubuntu, memtest86+
root		(hd0,0)
kernel		/memtest86+.bin
quiet

### END DEBIAN AUTOMAGIC KERNELS LIST

I really appreciate the help.

tredegar · 06-30-2007, 10:03 AM

Quote:

Where exactly should the three options above go in this file:

Here:

Code:

kernel		/vmlinuz-2.6.20-16-generic root=UUID=b9348ed0-9064-4863-8f17-b4a7da2ff060 ro quiet splash noapic nolapic acpi=off

You may need to try different combinations (max 7 to try!)

Maybe also check your memory - run memtest86 at your grub kernel selection screen. be warned: it takes ages!

jiml8 · 06-30-2007, 06:09 PM

Try to ssh into the locked up linux box from another computer (maybe your laptop...)

You will find that in most circumstances the condition you describe isn't a crash, but something has deadlocked or else is in an infinite loop. You would encounter this problem in the event of a substantial misconfiguration of something, or if you have a corrupted filesystem, or if you have mismatched libraries. Or, sometimes, if you just are running a really badly behaved application.

So the machine might not be crashed, and if you manage to ssh in from a remote location you can both find out where the problem is AND fix it.

If there is a busy-wait (infinite loop) sucking up your processor, then you may have to wait for awhile for the ssh login to be processed, and responsiveness to your remote shell may suck. But if the system isn't genuinely locked up, this is the best way to go to sort out the problem.

Larry Webb · 06-30-2007, 07:13 PM

If this has been happening gradually and mem test passes, take the side panel off and take compressed air and blow out the ps and cpu cooling fins. In our part of the country you need to do it every 6 to 9 mo.

pickarooney · 06-30-2007, 07:54 PM

OK, I tried editing the line to

Code:

kernel		/vmlinuz-2.6.20-16-generic root=UUID=b9348ed0-9064-4863-8f17-b4a7da2ff060 ro quiet splash noapic nolapic noacpi

But I can't get past the splash screen; impossible to boot. Furthermore, the PC won't switch back on again after I power down unless I remove and replace the battery or hold the power button down while unplugging and replugging the cable.

When I did manage to restart, I got stuck on the splash-screen again, hit Ctrl-Alt-F1 and found the boot screen hanging on a message telling me an image could not be found. I hit Ctrl-Alt-Del and the screen flickered on and off and came back to the boot screen with thousands of messages :
hdb: drive not ready for command

I powered down and rebooted to a LiveCD.

Last time I got a lockup I let it sit for a while and was able to move the mouse about once every five minutes by a couple of pixels, so yes, I think it is some kind of deadlock.
I have no other PCs nor access to any means of connecting remotely to this one, unfortunately.

I think I'm going to have to just guess at what's most likely to be the problem and

1 - run memtest all night
and if that gives no clues
2 - reformat the hard drive
and if that doesn't work
3 - buy a new HD
and if that doesn't work
4 - buy a new motherboard
and if that doesn't work
5 - ???

ivotron · 06-30-2007, 08:01 PM

Try with acpi=off noapic nolapic

pickarooney · 07-01-2007, 06:15 AM

OK, memtest ran for 10 hours with no errors.
Rebooted and got stuck on the first splashscreen (the PC manufacturer's name with options to edit BIOS settings) for about a minute, then on the Kubuntu splashscreen for another two minutes.
Hit Ctrl-Alt-1: No resume image: doing normal boot. It hung here until I hit Altr-Alt-Del.

Next I get:

fdisk died with exit status 8

and land on the root@machinename prompt with no mounted drives.

Ctrl-D and the OS finally boots.

I try re-editing menus.lst with acpi=off noapic nolapic and reboot. Exactly the same procedure again. (at least it booted after I forced it)

This kind of points to a hard drive problem, but would that explain the lengthy hang on the initial splashscreen?

pickarooney · 07-01-2007, 06:43 AM

The plot thickens...

I shut down, 'flushed' the CMOS battery and restarted. Manufacturer's splashscreen came and went as normal, GRUB went OK, Kubuntu splasshcreen appeared and loaded to half-way, then bombed out to a black screen. New message:

hdb1 contains a file system with errors, check forced.

Nothing further happens until I hit Ctrl-Alt-Del and booting resumes, but fails on login with a small window marked

Could not start kstartupconfig. check your installation.

I try a few times to login, but nothing doing.

Back on line via LiveCD...

/dev/hdb1 is a HD with only data files on it - images, music etc. No reason why a corruption on it should stop the OS from loading, right?
This new error about kstartupconfig - where's it coming from and what's my best next step?

I guess I need to see about fixing the file system on hdb1 but I also need to resolve the impossibility of logging in...

Also, I don't have an install CD of the latest version of Kubuntu - this was installed via update manager. The ISO of the latest CD is on the hard drive with the OS. If I could get in, I could burn it...

Am I going round in circles or just circling the drain at this stage?

tredegar · 07-01-2007, 07:49 AM

If you can boot from a live CD, you should be able to run fsck on /dev/hdb1 (which needs to be unmounted while fsck checks it).
man fsck for all the details, options, and warnings.

If /dev/hdb is not necessary for you to boot your system, does it boot any better if you physically remove it?

pickarooney · 07-01-2007, 08:16 AM

Quote:

Originally Posted by tredegar

If you can boot from a live CD, you should be able to run fsck on /dev/hdb1 (which needs to be unmounted while fsck checks it).
man fsck for all the details, options, and warnings.

If /dev/hdb is not necessary for you to boot your system, does it boot any better if you physically remove it?

I'll try removing it next time I reboot (although I'll still have that strange kstartupconfig problem -some error in my user config I think).

Meanwhile, fsck gave me the following result. Not very encouraging

Code:

sudo fsck /dev/hdb1
fsck 1.38 (30-Jun-2005)
e2fsck 1.38 (30-Jun-2005)
fsck.ext2: Attempt to read block from filesystem resulted in short read while tr ying to open /dev/hdb1
Could this be a zero-length partition?

I was ble to mount this earlier on, but now when I try

Code:

sudo mount -t ext3 /dev/hdb1 /media/disk/

I get this:

Code:

mount: wrong fs type, bad option, bad superblock on /dev/hdb1,
       missing codepage or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

dmesg| tail -20 gives me

Code:

[4300998.123000] end_request: I/O error, dev hdb, sector 248
[4300998.123000] end_request: I/O error, dev hdb, sector 249
[4300998.123000] end_request: I/O error, dev hdb, sector 250
[4300998.123000] end_request: I/O error, dev hdb, sector 251
[4300998.123000] end_request: I/O error, dev hdb, sector 252
[4300998.123000] end_request: I/O error, dev hdb, sector 253
[4300998.123000] end_request: I/O error, dev hdb, sector 254
[4300998.123000] end_request: I/O error, dev hdb, sector 255
[4300998.123000] end_request: I/O error, dev hdb, sector 256
[4300998.123000] end_request: I/O error, dev hdb, sector 257
[4300998.123000] end_request: I/O error, dev hdb, sector 258
[4300998.123000] end_request: I/O error, dev hdb, sector 259
[4300998.123000] end_request: I/O error, dev hdb, sector 260
[4300998.123000] end_request: I/O error, dev hdb, sector 261
[4300998.123000] end_request: I/O error, dev hdb, sector 262
[4300998.123000] end_request: I/O error, dev hdb, sector 63
[4300998.123000] SQUASHFS error: sb_bread failed reading block 0x0
[4300998.123000] SQUASHFS error: unable to read superblock
[4301003.481000] end_request: I/O error, dev hdb, sector 65
[4301003.481000] EXT3-fs: unable to read superblock

Wow, looks like I'm in even worse shape that I thought

edit: I get exactly the same problem tryign to fsck the other IDE HD (/dev/hda*, my main HD is /dev/sda1). Nobody could be unlucky enough to lose two hard drives, surely?

tredegar · 07-01-2007, 11:14 AM

I agree - it doesn't look good.

Quote:

Nobody could be unlucky enough to lose two hard drives, surely?

Unlikely. But maybe your controller is bad?
Can you try those HDDs in another box?

pickarooney · 07-01-2007, 12:31 PM

Just restarted the PC after three hours cooling off (me as much as the computer

) and it booted with no problems or error messages, with all drives and partitions mounted and readable.

Perhaps the drives themselves were all heating one another up? Back to square one, in any case - apart from the fact that APCI/APIC has now been disabled.

Hmm, it rained in the meantime, perhaps cooling things down by a few degrees.

[breaking news] I just heard a whirr-click from one of the drives and had a momentary freeze, and now everything is slowing down...

jiml8 · 07-01-2007, 02:29 PM

Sure looks like a hardware fault at this point.

Probably you don't have 2 HDs going bad, but remember that there is still a single point of failure; your controller or your cable. Try unplugging/replugging the cable. Make sure the controller chip isn't getting too hot. Try replacing the cable.

You also potentially could have a jumper problem that would cause this if both drives are on the same cable and, for instance, both are jumpered "master".