LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   Strange crashes (computer freezes 95%, i.e. not completely) (https://www.linuxquestions.org/questions/linux-general-1/strange-crashes-computer-freezes-95-i-e-not-completely-937534/)

exscape 04-01-2012 08:17 AM

Strange crashes (computer freezes 95%, i.e. not completely)
 
I'm not sure where to post this - the software/kernel/server forums are all possibilities, but I ended up with this forum.
Anyhow:

The past week I've been getting strange crashes from my Linux server. After crashing, it still responds to ping and connect() (nmap port scans show open ports), but cannot serve web pages, doesn't accept SSH connections (apart from the initial connect call that works), and apparently all processes stop running.
Also, it acts as a NAT router, and the computers behind it CAN still access the internet through it despite it having "crashed".

On the monitor/keyboard side, there's nothing interesting printed to the screen, just the login prompt. The keyboard appeared unresponsive until Alt+SysRq+E was pressed (I think - I did this over the phone, not locally... It was some SysRq key that unfroze the keyboard). After that, keyboard input worked, but since all processes were killed, I don't know if connecting to it would've worked, and I didn't log in as I weren't there myself.

My first guess, after noticing that the logs just ended at the crash time (but networking appeared semi-OK) was that the processes kept running, but no new ones started. That turned out to be false: I left mpstat + iostat + vmstat logging to files, and they too simply stopped at the same time as everything else did.

At this point I don't have a clue what it might be.
I've used the same kernel config/build since January, and the problems started March 23rd. It has since crashed every 1-3 days. I tried to revert the packages I updated that day - didn't help. (I didn't expect it to work, as there were no system packages in the list.)

Hardware:
Gigabyte GA-P31-DS3L (Intel P31 chipset)
Pentium Dual Core E2200 (Core 2-based, with less L2 cache)
2x2 GB DDR2-800
4 disks (2x Samsung F4, 1x WD Caviar Green WD10EADS, 1x Hitachi Deskstar 7K1000.B)

Software:
Gentoo/amd64, stable
Linux 3.0.6-gentoo (custom config)

CPU/RAM stress testing (linpack, cpuburn) indicates that it's stable in that regard. Besides, the crashes appear to happen with almost zero load.

Any ideas on what to try next?

business_kid 04-01-2012 11:54 AM

Make sure there's sshd started. As networking is semi-ok, your best chances may be in using ssh to log in and interrogate it (ps -e, top, or lsof may all be interesting).

Go over it with rkhunter or similar.

I did have something vaguely similar on an old k6 board. The ide had 2 disks on it, and it started throwing errors to stdout

hda: not ready
hdb: not ready

I gathered from using an oscilloscope on the ide line that lows were not going fully low, and the box just went out to lunch.(If this is adjustable in the bios, try it.) Any process requiring storage would crap out; networking was ok. The kernel was there, but the entire system was gone.

Even Ctrl_Alt_Del would fail
hda: not ready
hdb: not ready

because it was running 'shutdown -r now.'

I downgraded it to windows 98, & one hd, and gave it to someone I knew with a need for a pc but without the cash to upgrade his 386, and it lived into old age.

Food for thought. . .

exscape 04-01-2012 12:56 PM

sshd is always running, but it won't accept connections when this happens (2nd paragraph of OP :)). Thus I can't run any commands. I'm not 100% sure (as, again, this was over the phone) but I think the local keyboard was also unresponsive to begin with.
I'm trying to stay logged in via SSH in case it crashes, but I don't expect to be able to do anything since it appears to simply stop running.

business_kid 04-01-2012 01:37 PM

If it's a hardware problem, it _will_ get worse. If it's software, it may do anything. It's quite possible everything is up; routing works, but anything requiring / is down. Top, ps, ls, etc are awol.

It really sounds like a variation on my problem in post #2

exscape 04-06-2012 08:56 AM

I've stayed logged in via SSH 100% of the time my laptop is on (when I'm awake), and it's crashed twice today - lo and behold, I *can* run commands. Some, and they might get stuck in uninterruptible sleep, but at least it kinda works.

Anyhow. The result are... >300 cron processes, many of which are in uninterruptible sleep - the rest in regular sleep ("S" state). There are also perhaps a half dozen couriertls processes, presumably from my mail client attempting to check the inbox, and getting stuck. Load averages go over 300 before it appears to hang more or less completely to SSH input.

ANY process that tries to write to disk (logging, ls > /tmp/test, etc.) appear to hang - so business_kid may well be on to something.
sudo hangs at least after a while (as the load increases).

However, dmesg shows nothing at all out of the ordinary, and as I said, log files stop working... So how the heck do I continue to narrow this down?
smartctl -A looks good for all disks (taken just after the crash "started" last time, before it froze completely).

business_kid 04-07-2012 09:59 AM

If I'm right - and the fact that you're out of other ideas tends to support this, It's a hardware issue. Low is not going low, or high is not going high. No point in checking logs - Catch-22! If you could write logs, you don't have an error:-).

Check your BIOS settings for a disk drive current setting, and change it.

Unhook some disks and try it. Failing that, it's the ide/sata card, or Southbridge.

exscape 04-07-2012 10:26 AM

Quote:

Originally Posted by business_kid (Post 4647133)
If I'm right - and the fact that you're out of other ideas tends to support this, It's a hardware issue. Low is not going low, or high is not going high. No point in checking logs - Catch-22! If you could write logs, you don't have an error:-).

Check your BIOS settings for a disk drive current setting, and change it.

Unhook some disks and try it. Failing that, it's the ide/sata card, or Southbridge.

Right, logs clearly won't work, but dmesg should (it's a RAM buffer), and that is empty. I've never previously had a HDD/storage controller issue with absolutely no error messages in dmesg while it's happening. :/

business_kid 04-07-2012 11:33 AM

Errors on stdout? That's where I got them. Some random terminal, if disaster struck while that was focused, otherwise (Because I use runlevel 3 & startx) on Ctrl_Alt_F1. It's also possible it's a heat related memory issue, or data transfer issue (i.e. northbridge - southbridge)

Right now, on this box, the last dmesg entries are from wlan0, but there's a bucketful of 'OLE' errors from wine on Ctrl_Alt_F1

Look for a link to thermal changes.

exscape 04-07-2012 11:38 AM

Probably not thermal, at least not northbridge/CPU as both are under control. I've also stress-tested the CPU with no problems - and to add it that, it's currently slightly underclocked, whereas I've had it 36% OC'ed for 2 years or so.

There are no errors over SSH, and not on the first "local console" either. :/

business_kid 04-07-2012 01:38 PM

Move slowly now.

Think it through carefully. You're very close to it; You're going to solve this, because I don't have the details.

No kernel crash, no errors anywhere, but no disk activity; You say
Quote:

I *can* run commands. Some,
Shell commands requiring no disk activity might run (Can you think of one?), as might anything in ram. Otherwise, we need to know what you can run. I get the feel it's queuing stuff and saying nothing. Maybe that could be explained by a very low log level.

Then try anything you can remove in some other box.

exscape 04-07-2012 01:41 PM

Quote:

Originally Posted by business_kid (Post 4647265)
Move slowly now.

Think it through carefully. You're very close to it; You're going to solve this, because I don't have the details.

No kernel crash, no errors anywhere, but no disk activity; You say

Shell commands requiring no disk activity might run (Can you think of one?), as might anything in ram. Otherwise, we need to know what you can run. I get the feel it's queuing stuff and saying nothing. Maybe that could be explained by a very low log level.

Then try anything you can remove in some other box.

Commands that only *read* from the disk works (including relatively disk-intensive stuff like find /) until the load creeps up way too high. Writing to / (sda2) stops working when it hangs (I assume that's exactly what makes it hang in the first place). Writing to a mdadm RAID1 array composed of a sda partition and a sdc partition does seem to work - and iostat even reports written data to sda during that.

Since that makes it sound as if only the root *partition* is broken, the next thing I'm trying is to simply fsck it (and/or, the next time this happens, I'll try writing to /boot, which is a separate partition on sda), though my hopes aren't too high that it will help.

business_kid 04-08-2012 02:57 AM

That kinda narrows it to the hard disk, or conceivably the controller.
Hard disks go in 2 ways: the board of the (more common) surface errors.The other possibility is the kernel. I would reach for a standard issue 2.6 kernel from your distro and try for that. If that crashes, eliminate the kernel.

Chips usually fail catastrophically. But some failures involve a few of the many x 100k internal devices, and never affect you until that part of the circuit becomes relevant. Then that little bit can lock up an otherwise good IC in some way.

FWIW, I have a laptop here (this laptop) which runs 3.0.4 and hangs occasionally - usually video related e.g. running X, hit Ctrl_Alt_F3 for a terminal I'm logged into, and it's gone. Mouse moves but does nothing, keyboard is awol. It's particularly tricky when I have it running an external monitor, suspend it, and resume elsewhere without the monitor. I couldn't give an <expletive deleted> what is hung or not hung, I just kill it and restart.

exscape 04-18-2012 07:32 AM

Quote:

Originally Posted by exscape (Post 4647271)
Writing to / (sda2) stops working when it hangs (I assume that's exactly what makes it hang in the first place). Writing to a mdadm RAID1 array composed of a sda partition and a sdc partition does seem to work - and iostat even reports written data to sda during that.

Since that makes it sound as if only the root *partition* is broken, the next thing I'm trying is to simply fsck it (and/or, the next time this happens, I'll try writing to /boot, which is a separate partition on sda), though my hopes aren't too high that it will help.

It sure looks like this part was correct!
I did try writing to /boot, which worked flawlessly despite / having hanged.
First, I booted off a LiveCD to run reiserfsck on /, which reported no errors. Since it considered the partition OK, and no changes had been made (same kernel for many months before the crashes, same reiserfsprogs), I decided to try a filesystem switch. So, I moved all data off / (LiveCD again), formatted it as ext4, and moved the data back.

I haven't had a crash since, despite stress tests. Uptime > 1.5 weeks, whereas it used to crash within 2 hours of stress testing.

:)

business_kid 04-18-2012 11:41 AM

So, resierfs was hanging. Good detective work. The boys over at the reiser project need a bug report from you, and you can mark this solved.
Sorry for continual misdirection, but I was stuck in a hardware loop.


All times are GMT -5. The time now is 05:48 AM.