LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices

Reply
 
Search this Thread
Old 01-20-2007, 11:12 PM   #1
Soynuts
LQ Newbie
 
Registered: Jan 2006
Posts: 14

Rep: Reputation: 0
Total System lock up


I am absolutely confused as to why my system locks up every 5 days. I am currently running gentoo on my dedicated colo box, so my main way of access currently is by ssh. Physical access would require a 3 hour drive on my part and I don't want to do that unless the box is in a really critical state.

So, about every 5 days (plus or minus a few hours) the entire system stops responding. Log files stop being written to and most applications stop responding (though apache still responds as long as it's a plain HTML page). SSH attempts do not go through and just timeout. If I happened to have left an ssh window open during the time it goes, commands take 5+minutes to respond and respond as failing to execute. I believe the entire system some how goes into read-only mode (since it stops writting to the log files) but I have no idea what would cause it to. Logs show nothing useful. Only way to fix it is to call in and have the noc hard reboot my system. At this point I am at a total loss of what to do aside from taking my server down for a week while I do some real tests on it.

I can only think of two issues:

1) A process has gone out of control (hope is this, just a matter of finding it)
2) Hard drive failing (hope it isn't this... the entire box is only 6 months old and would be a pain to fix)

Are there any tests I can perform remotely that could help with me to track down the cause of this?

Here are the apps I am currently running on my gentoo box:
MySQL5
PHP5
Apache2.2
Postfix
ClamSMTP
ClamAV
dSpam
Postgrey
Some Counter-Strike Source servers
Bind9 Server
Shorewall
Syslog-ng
Fail2Ban
 
Old 01-21-2007, 12:13 AM   #2
Matir
Moderator
 
Registered: Nov 2004
Location: San Jose, CA
Distribution: Ubuntu
Posts: 8,507

Rep: Reputation: 118Reputation: 118
You could test/check the hard drive by using the smartmontools package to perform a SMART test on the drive. Have you looked at the logs after rebooting the system?
 
Old 01-21-2007, 09:55 AM   #3
Soynuts
LQ Newbie
 
Registered: Jan 2006
Posts: 14

Original Poster
Rep: Reputation: 0
Ya, the logs after a reboot show nothing out of the ordinary. No warnings or errors show up in /var/etc/messages or in any of the other log files I have in /var/etc. Everything shows up as starting up correctly.

I'll give that smartmontool a try. =)
 
Old 01-21-2007, 02:24 PM   #4
Soynuts
LQ Newbie
 
Registered: Jan 2006
Posts: 14

Original Poster
Rep: Reputation: 0
Hmmm well smartmontools showed that everything seemed to be just fine. About the time I installed postfix was when these problems started to show up. But I can't say for sure that it's postfix that was the cause of it.
 
Old 01-21-2007, 03:31 PM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 12,201

Rep: Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015Reputation: 1015
Seems you need some history data to see if you have a CPU and/or memory consumption issue. Have a look at the sysstat package, or simply run something like top or vmstat in batch mode, saving to disk.
 
Old 01-22-2007, 07:49 AM   #6
Soynuts
LQ Newbie
 
Registered: Jan 2006
Posts: 14

Original Poster
Rep: Reputation: 0
Ugh, now I'm seeing this in my /var/messages log

Quote:
Jan 22 06:23:41 UNBUBox ata2: status=0xd0 { Busy }
Jan 22 06:23:41 UNBUBox sd 1:0:0:0: SCSI error: return code = 0x8000002
Jan 22 06:23:41 UNBUBox sda: Current: sense key: Aborted Command
Jan 22 06:23:41 UNBUBox Additional sense: Scsi parity error
Jan 22 06:23:41 UNBUBox end_request: I/O error, dev sda, sector 264544441
Jan 22 06:23:41 UNBUBox lost page write due to I/O error on sda4
Jan 22 06:23:41 UNBUBox ATA: abnormal status 0xD0 on port 0xFFFFC2000000411C
Jan 22 06:23:41 UNBUBox ATA: abnormal status 0xD0 on port 0xFFFFC2000000411C
Jan 22 06:23:41 UNBUBox ATA: abnormal status 0xD0 on port 0xFFFFC2000000411C
 
Old 01-22-2007, 08:45 AM   #7
Matir
Moderator
 
Registered: Nov 2004
Location: San Jose, CA
Distribution: Ubuntu
Posts: 8,507

Rep: Reputation: 118Reputation: 118
That's probably not a great thing to be seeing. Is this the first instance? I would start by making sure all of your cables are firmly seated. This would be more likely with PATA, but it doesn't hurt to try it.
 
Old 01-22-2007, 08:57 AM   #8
Soynuts
LQ Newbie
 
Registered: Jan 2006
Posts: 14

Original Poster
Rep: Reputation: 0
Ya, this is the first time I'm seeing it. It's been 6 months since I installed everything and I can't imagine it suddenly coming loose in my 1U colo box very easily. I'm leaning more towards the hard drive failing, but at least that should be covered under warranty as it hasn't been long since I've had it. Just going to be a pain to go to the server.
 
Old 01-22-2007, 09:03 AM   #9
Matir
Moderator
 
Registered: Nov 2004
Location: San Jose, CA
Distribution: Ubuntu
Posts: 8,507

Rep: Reputation: 118Reputation: 118
Yeah, just be careful. I once had a problem that looked like a hard drive failure... RMAed the drive... had another "failure" of the replacement drive, RMAed it again... turns out the motherboard was dying on me and its controller was making it look like the drive was dying.
 
Old 01-22-2007, 09:56 AM   #10
Soynuts
LQ Newbie
 
Registered: Jan 2006
Posts: 14

Original Poster
Rep: Reputation: 0
I'll definitely keep that in mind when I go and check out the box for myself.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Total lock-ups. How do I track it down? Mike Healan Suse/Novell 1 08-02-2006 07:24 PM
Total system hang jrdioko Linux - Software 3 01-16-2006 10:35 PM
Total system lock when running gnome on Fedora insectile Fedora 5 07-23-2005 10:44 PM
Total system death? Help me through this please oudent Linux - Hardware 10 02-17-2005 12:35 PM
how to get Total no of process in system arb Linux - General 1 09-17-2003 08:20 AM


All times are GMT -5. The time now is 02:17 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration