LinuxQuestions.org
Did you know LQ has a Linux Hardware Compatibility List?
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices



Reply
 
Search this Thread
Old 04-01-2012, 09:17 AM   #1
exscape
Member
 
Registered: Aug 2007
Location: Sweden
Distribution: OS X, Gentoo, FreeBSD
Posts: 82

Rep: Reputation: 15
Question Strange crashes (computer freezes 95%, i.e. not completely)


I'm not sure where to post this - the software/kernel/server forums are all possibilities, but I ended up with this forum.
Anyhow:

The past week I've been getting strange crashes from my Linux server. After crashing, it still responds to ping and connect() (nmap port scans show open ports), but cannot serve web pages, doesn't accept SSH connections (apart from the initial connect call that works), and apparently all processes stop running.
Also, it acts as a NAT router, and the computers behind it CAN still access the internet through it despite it having "crashed".

On the monitor/keyboard side, there's nothing interesting printed to the screen, just the login prompt. The keyboard appeared unresponsive until Alt+SysRq+E was pressed (I think - I did this over the phone, not locally... It was some SysRq key that unfroze the keyboard). After that, keyboard input worked, but since all processes were killed, I don't know if connecting to it would've worked, and I didn't log in as I weren't there myself.

My first guess, after noticing that the logs just ended at the crash time (but networking appeared semi-OK) was that the processes kept running, but no new ones started. That turned out to be false: I left mpstat + iostat + vmstat logging to files, and they too simply stopped at the same time as everything else did.

At this point I don't have a clue what it might be.
I've used the same kernel config/build since January, and the problems started March 23rd. It has since crashed every 1-3 days. I tried to revert the packages I updated that day - didn't help. (I didn't expect it to work, as there were no system packages in the list.)

Hardware:
Gigabyte GA-P31-DS3L (Intel P31 chipset)
Pentium Dual Core E2200 (Core 2-based, with less L2 cache)
2x2 GB DDR2-800
4 disks (2x Samsung F4, 1x WD Caviar Green WD10EADS, 1x Hitachi Deskstar 7K1000.B)

Software:
Gentoo/amd64, stable
Linux 3.0.6-gentoo (custom config)

CPU/RAM stress testing (linpack, cpuburn) indicates that it's stable in that regard. Besides, the crashes appear to happen with almost zero load.

Any ideas on what to try next?
 
Old 04-01-2012, 12:54 PM   #2
business_kid
Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware & Android
Posts: 6,624

Rep: Reputation: 585Reputation: 585Reputation: 585Reputation: 585Reputation: 585Reputation: 585
Make sure there's sshd started. As networking is semi-ok, your best chances may be in using ssh to log in and interrogate it (ps -e, top, or lsof may all be interesting).

Go over it with rkhunter or similar.

I did have something vaguely similar on an old k6 board. The ide had 2 disks on it, and it started throwing errors to stdout

hda: not ready
hdb: not ready

I gathered from using an oscilloscope on the ide line that lows were not going fully low, and the box just went out to lunch.(If this is adjustable in the bios, try it.) Any process requiring storage would crap out; networking was ok. The kernel was there, but the entire system was gone.

Even Ctrl_Alt_Del would fail
hda: not ready
hdb: not ready

because it was running 'shutdown -r now.'

I downgraded it to windows 98, & one hd, and gave it to someone I knew with a need for a pc but without the cash to upgrade his 386, and it lived into old age.

Food for thought. . .
 
Old 04-01-2012, 01:56 PM   #3
exscape
Member
 
Registered: Aug 2007
Location: Sweden
Distribution: OS X, Gentoo, FreeBSD
Posts: 82

Original Poster
Rep: Reputation: 15
sshd is always running, but it won't accept connections when this happens (2nd paragraph of OP ). Thus I can't run any commands. I'm not 100% sure (as, again, this was over the phone) but I think the local keyboard was also unresponsive to begin with.
I'm trying to stay logged in via SSH in case it crashes, but I don't expect to be able to do anything since it appears to simply stop running.
 
Old 04-01-2012, 02:37 PM   #4
business_kid
Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware & Android
Posts: 6,624

Rep: Reputation: 585Reputation: 585Reputation: 585Reputation: 585Reputation: 585Reputation: 585
If it's a hardware problem, it _will_ get worse. If it's software, it may do anything. It's quite possible everything is up; routing works, but anything requiring / is down. Top, ps, ls, etc are awol.

It really sounds like a variation on my problem in post #2
 
Old 04-06-2012, 09:56 AM   #5
exscape
Member
 
Registered: Aug 2007
Location: Sweden
Distribution: OS X, Gentoo, FreeBSD
Posts: 82

Original Poster
Rep: Reputation: 15
I've stayed logged in via SSH 100% of the time my laptop is on (when I'm awake), and it's crashed twice today - lo and behold, I *can* run commands. Some, and they might get stuck in uninterruptible sleep, but at least it kinda works.

Anyhow. The result are... >300 cron processes, many of which are in uninterruptible sleep - the rest in regular sleep ("S" state). There are also perhaps a half dozen couriertls processes, presumably from my mail client attempting to check the inbox, and getting stuck. Load averages go over 300 before it appears to hang more or less completely to SSH input.

ANY process that tries to write to disk (logging, ls > /tmp/test, etc.) appear to hang - so business_kid may well be on to something.
sudo hangs at least after a while (as the load increases).

However, dmesg shows nothing at all out of the ordinary, and as I said, log files stop working... So how the heck do I continue to narrow this down?
smartctl -A looks good for all disks (taken just after the crash "started" last time, before it froze completely).
 
Old 04-07-2012, 10:59 AM   #6
business_kid
Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware & Android
Posts: 6,624

Rep: Reputation: 585Reputation: 585Reputation: 585Reputation: 585Reputation: 585Reputation: 585
If I'm right - and the fact that you're out of other ideas tends to support this, It's a hardware issue. Low is not going low, or high is not going high. No point in checking logs - Catch-22! If you could write logs, you don't have an error:-).

Check your BIOS settings for a disk drive current setting, and change it.

Unhook some disks and try it. Failing that, it's the ide/sata card, or Southbridge.
 
Old 04-07-2012, 11:26 AM   #7
exscape
Member
 
Registered: Aug 2007
Location: Sweden
Distribution: OS X, Gentoo, FreeBSD
Posts: 82

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by business_kid View Post
If I'm right - and the fact that you're out of other ideas tends to support this, It's a hardware issue. Low is not going low, or high is not going high. No point in checking logs - Catch-22! If you could write logs, you don't have an error:-).

Check your BIOS settings for a disk drive current setting, and change it.

Unhook some disks and try it. Failing that, it's the ide/sata card, or Southbridge.
Right, logs clearly won't work, but dmesg should (it's a RAM buffer), and that is empty. I've never previously had a HDD/storage controller issue with absolutely no error messages in dmesg while it's happening. :/
 
Old 04-07-2012, 12:33 PM   #8
business_kid
Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware & Android
Posts: 6,624

Rep: Reputation: 585Reputation: 585Reputation: 585Reputation: 585Reputation: 585Reputation: 585
Errors on stdout? That's where I got them. Some random terminal, if disaster struck while that was focused, otherwise (Because I use runlevel 3 & startx) on Ctrl_Alt_F1. It's also possible it's a heat related memory issue, or data transfer issue (i.e. northbridge - southbridge)

Right now, on this box, the last dmesg entries are from wlan0, but there's a bucketful of 'OLE' errors from wine on Ctrl_Alt_F1

Look for a link to thermal changes.
 
Old 04-07-2012, 12:38 PM   #9
exscape
Member
 
Registered: Aug 2007
Location: Sweden
Distribution: OS X, Gentoo, FreeBSD
Posts: 82

Original Poster
Rep: Reputation: 15
Probably not thermal, at least not northbridge/CPU as both are under control. I've also stress-tested the CPU with no problems - and to add it that, it's currently slightly underclocked, whereas I've had it 36% OC'ed for 2 years or so.

There are no errors over SSH, and not on the first "local console" either. :/
 
Old 04-07-2012, 02:38 PM   #10
business_kid
Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware & Android
Posts: 6,624

Rep: Reputation: 585Reputation: 585Reputation: 585Reputation: 585Reputation: 585Reputation: 585
Move slowly now.

Think it through carefully. You're very close to it; You're going to solve this, because I don't have the details.

No kernel crash, no errors anywhere, but no disk activity; You say
Quote:
I *can* run commands. Some,
Shell commands requiring no disk activity might run (Can you think of one?), as might anything in ram. Otherwise, we need to know what you can run. I get the feel it's queuing stuff and saying nothing. Maybe that could be explained by a very low log level.

Then try anything you can remove in some other box.
 
Old 04-07-2012, 02:41 PM   #11
exscape
Member
 
Registered: Aug 2007
Location: Sweden
Distribution: OS X, Gentoo, FreeBSD
Posts: 82

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by business_kid View Post
Move slowly now.

Think it through carefully. You're very close to it; You're going to solve this, because I don't have the details.

No kernel crash, no errors anywhere, but no disk activity; You say

Shell commands requiring no disk activity might run (Can you think of one?), as might anything in ram. Otherwise, we need to know what you can run. I get the feel it's queuing stuff and saying nothing. Maybe that could be explained by a very low log level.

Then try anything you can remove in some other box.
Commands that only *read* from the disk works (including relatively disk-intensive stuff like find /) until the load creeps up way too high. Writing to / (sda2) stops working when it hangs (I assume that's exactly what makes it hang in the first place). Writing to a mdadm RAID1 array composed of a sda partition and a sdc partition does seem to work - and iostat even reports written data to sda during that.

Since that makes it sound as if only the root *partition* is broken, the next thing I'm trying is to simply fsck it (and/or, the next time this happens, I'll try writing to /boot, which is a separate partition on sda), though my hopes aren't too high that it will help.

Last edited by exscape; 04-07-2012 at 02:44 PM.
 
Old 04-08-2012, 03:57 AM   #12
business_kid
Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware & Android
Posts: 6,624

Rep: Reputation: 585Reputation: 585Reputation: 585Reputation: 585Reputation: 585Reputation: 585
That kinda narrows it to the hard disk, or conceivably the controller.
Hard disks go in 2 ways: the board of the (more common) surface errors.The other possibility is the kernel. I would reach for a standard issue 2.6 kernel from your distro and try for that. If that crashes, eliminate the kernel.

Chips usually fail catastrophically. But some failures involve a few of the many x 100k internal devices, and never affect you until that part of the circuit becomes relevant. Then that little bit can lock up an otherwise good IC in some way.

FWIW, I have a laptop here (this laptop) which runs 3.0.4 and hangs occasionally - usually video related e.g. running X, hit Ctrl_Alt_F3 for a terminal I'm logged into, and it's gone. Mouse moves but does nothing, keyboard is awol. It's particularly tricky when I have it running an external monitor, suspend it, and resume elsewhere without the monitor. I couldn't give an <expletive deleted> what is hung or not hung, I just kill it and restart.
 
Old 04-18-2012, 08:32 AM   #13
exscape
Member
 
Registered: Aug 2007
Location: Sweden
Distribution: OS X, Gentoo, FreeBSD
Posts: 82

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by exscape View Post
Writing to / (sda2) stops working when it hangs (I assume that's exactly what makes it hang in the first place). Writing to a mdadm RAID1 array composed of a sda partition and a sdc partition does seem to work - and iostat even reports written data to sda during that.

Since that makes it sound as if only the root *partition* is broken, the next thing I'm trying is to simply fsck it (and/or, the next time this happens, I'll try writing to /boot, which is a separate partition on sda), though my hopes aren't too high that it will help.
It sure looks like this part was correct!
I did try writing to /boot, which worked flawlessly despite / having hanged.
First, I booted off a LiveCD to run reiserfsck on /, which reported no errors. Since it considered the partition OK, and no changes had been made (same kernel for many months before the crashes, same reiserfsprogs), I decided to try a filesystem switch. So, I moved all data off / (LiveCD again), formatted it as ext4, and moved the data back.

I haven't had a crash since, despite stress tests. Uptime > 1.5 weeks, whereas it used to crash within 2 hours of stress testing.

 
Old 04-18-2012, 12:41 PM   #14
business_kid
Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware & Android
Posts: 6,624

Rep: Reputation: 585Reputation: 585Reputation: 585Reputation: 585Reputation: 585Reputation: 585
So, resierfs was hanging. Good detective work. The boys over at the reiser project need a bug report from you, and you can mark this solved.
Sorry for continual misdirection, but I was stuck in a hardware loop.
 
  


Reply

Tags
crash


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
xorg / startx completely freezes system every time granth Slackware 17 11-06-2008 11:42 PM
imac completely crashes booting ubuntu 8.04 tommytomthms5 Linux - Hardware 2 07-23-2008 03:02 PM
Startx freezes system completely, mysterious white dots Dymitry Slackware 5 02-19-2007 05:41 PM
CX88 chipset TV Card freezes the computer completely ObsidianX Linux - Hardware 2 06-01-2005 10:16 PM
system freezes completely c-- Ubuntu 2 03-29-2005 05:23 PM


All times are GMT -5. The time now is 06:45 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration