LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 11-09-2010, 03:15 PM   #1
LyCC
LQ Newbie
 
Registered: Nov 2010
Posts: 6

Rep: Reputation: 0
Question Slackware 13 64 bit (+hybrid) server crash


Hello everyone,

Me and my friends have a Slackware 13 server (64bit hybrid) on a quad core AMD. Before Slackware installation we run a ~24 hour test with StressLinux (load 18-19) it went fine. So install slackware, etc etc etc, and we comed accorss a bizare crasing, and we can't find any reason. (the server is up for two months, and died 3 times untill now, but last time we actually looked into the logs, and what we found we can't explain). I was navigating on some pages, and i noticed that everything just stopped, i was still able to log in via SSH and indeed, all programs disapeared from top list, so i decided to give it a reboot. After the reboot command, the whole system just died and i needed to physically stop it and turn it back on. Last thing in the messages is that system is going to reboot.
Here is the messages (the "xxxxx" text is what i censored):

Nov 9 20:22:16 xxxxxxxxxxx -- MARK --
Nov 9 20:42:16 xxxxxxxxxxx -- MARK --
Nov 9 21:02:16 xxxxxxxxxxx -- MARK --
Nov 9 21:19:27 xxxxxxxxxxx kernel: >Pid: 4206, comm: httpd Tainted: G MB D 2.6.33.4 #3
Nov 9 21:22:38 xxxxxxxxxxx kernel: Mem-Info:
Nov 9 21:22:38 xxxxxxxxxxx kernel: 490992 pages RAM
Nov 9 21:22:38 xxxxxxxxxxx kernel: 12199 pages reserved
Nov 9 21:22:38 xxxxxxxxxxx kernel: 102254 pages shared
Nov 9 21:22:38 xxxxxxxxxxx kernel: 363584 pages non-shared
Nov 9 21:22:38 xxxxxxxxxxx kernel: Mem-Info:
Nov 9 21:22:38 xxxxxxxxxxx kernel: 490992 pages RAM
Nov 9 21:22:38 xxxxxxxxxxx kernel: 12199 pages reserved
Nov 9 21:22:38 xxxxxxxxxxx kernel: 99967 pages shared
Nov 9 21:22:38 xxxxxxxxxxx kernel: 361921 pages non-shared
Nov 9 21:22:38 xxxxxxxxxxx kernel: Mem-Info:
Nov 9 21:22:38 xxxxxxxxxxx kernel: 490992 pages RAM
Nov 9 21:22:38 xxxxxxxxxxx kernel: 12199 pages reserved
Nov 9 21:22:38 xxxxxxxxxxx kernel: 98050 pages shared
Nov 9 21:22:38 xxxxxxxxxxx kernel: 362992 pages non-shared
Nov 9 21:22:38 xxxxxxxxxxx kernel: Mem-Info:
Nov 9 21:22:38 xxxxxxxxxxx kernel: 490992 pages RAM
Nov 9 21:22:38 xxxxxxxxxxx kernel: 12199 pages reserved
Nov 9 21:22:38 xxxxxxxxxxx kernel: 95783 pages shared
Nov 9 21:22:38 xxxxxxxxxxx kernel: 350205 pages non-shared
Nov 9 21:22:38 xxxxxxxxxxx kernel: Mem-Info:
Nov 9 21:22:38 xxxxxxxxxxx kernel: 490992 pages RAM
Nov 9 21:22:38 xxxxxxxxxxx kernel: 12199 pages reserved
Nov 9 21:22:38 xxxxxxxxxxx kernel: 94486 pages shared
Nov 9 21:22:38 xxxxxxxxxxx kernel: 327130 pages non-shared
Nov 9 21:22:39 xxxxxxxxxxx kernel: Mem-Info:
Nov 9 21:22:39 xxxxxxxxxxx kernel: 490992 pages RAM
Nov 9 21:22:39 xxxxxxxxxxx kernel: 12199 pages reserved
Nov 9 21:22:39 xxxxxxxxxxx kernel: 94010 pages shared
Nov 9 21:22:39 xxxxxxxxxxx kernel: 326916 pages non-shared
Nov 9 21:22:39 xxxxxxxxxxx kernel: Mem-Info:
Nov 9 21:22:39 xxxxxxxxxxx kernel: 490992 pages RAM
Nov 9 21:22:39 xxxxxxxxxxx kernel: 12199 pages reserved
Nov 9 21:22:39 xxxxxxxxxxx kernel: 91592 pages shared
Nov 9 21:22:39 xxxxxxxxxxx kernel: 299799 pages non-shared
Nov 9 21:22:39 xxxxxxxxxxx kernel: Mem-Info:
Nov 9 21:22:39 xxxxxxxxxxx kernel: 490992 pages RAM
Nov 9 21:22:39 xxxxxxxxxxx kernel: 12199 pages reserved
Nov 9 21:22:39 xxxxxxxxxxx kernel: 90391 pages shared
Nov 9 21:22:39 xxxxxxxxxxx kernel: 261243 pages non-shared
Nov 9 21:22:39 xxxxxxxxxxx kernel: Mem-Info:
Nov 9 21:22:39 xxxxxxxxxxx kernel: 490992 pages RAM
Nov 9 21:22:39 xxxxxxxxxxx kernel: 12199 pages reserved
Nov 9 21:22:39 xxxxxxxxxxx kernel: 89119 pages shared
Nov 9 21:22:39 xxxxxxxxxxx kernel: 234625 pages non-shared
Nov 9 21:23:52 xxxxxxxxxxx kernel: lo: Disabled Privacy Extensions
Nov 9 21:23:57 xxxxxxxxxxx kernel: lo: Disabled Privacy Extensions
Nov 9 21:25:05 xxxxxxxxxxx sshd[4575]: Accepted password for xxxxx from xxxxxxxxxxxx port 60298 ssh2
Nov 9 21:28:21 xxxxxxxxxxx shutdown[4597]: shutting down for system reboot

And from here on is the new boot, after i psysically restarted the whole server... If anyone has seen this problem, please let me know of possible issues, solutions,
i would appreciate any help. Thank you

Last edited by LyCC; 11-09-2010 at 03:17 PM.
 
Old 11-10-2010, 08:15 AM   #2
udaman
Member
 
Registered: Oct 2010
Location: New England, USA
Distribution: OpenSUSE/Slackware64/RHEL/Mythbuntu
Posts: 189

Rep: Reputation: 39
It's very difficult, to tell, from the little info you've posted, what the problem might be. You could have bad memory chips, which is a simple fix to replace suspected DIMMs. Or, it could be some other hardware failing that's causing the OS to crash. Or, it could have been a buffer overflow, an attack from a worm, to try to gain access to your machine. I'd check all the hardware and run "chkrootkit" to see if there's a root kit on your system. If there's a coredump file, you can read that to get some more clues.

http://www.chkrootkit.org/
 
Old 11-10-2010, 08:35 AM   #3
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
If there's nothing else in the logs, I would try a newer kernel.
 
Old 11-11-2010, 08:17 AM   #4
LyCC
LQ Newbie
 
Registered: Nov 2010
Posts: 6

Original Poster
Rep: Reputation: 0
Thank you,

well, i'm not an expert, here's what rootkit tester said:

root@xxxxxxxxxx :/download/chkrootkit-0.49# ./chkrootkit -q
can't exec ./strings-static,
Message from syslogd@xxxxxxxxxx at Thu Nov 11 17:00:05 2010 ...
xxxxxxxxxx kernel: Code: 48 89 f3 49 89 d6 e8 41 33 3c 00 85 c0 75 1e 49 8b 95 f0 00 00 00 48 8b 8a 80 00 00 00 48 85 c9 74 22 4c 89 f2 48 89 de 4c 89 e7 <ff> d1 48 8b 5d e0 4c 8b 65 e8 4c 8b 6d f0 4c 8b 75 f8 c9 c3 0f

Message from syslogd@xxxxxxxxxx at Thu Nov 11 17:00:05 2010 ...
xxxxxxxxxx kernel: general protection fault: 0000 [#9] SMP

Message from syslogd@xxxxxxxxxx at Thu Nov 11 17:00:05 2010 ...
xxxxxxxxxx kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:15.0/0000:05:00.1/host8/target8:0:0/8:0:0:0/model

Message from syslogd@xxxxxxxxxx at Thu Nov 11 17:00:05 2010 ...
xxxxxxxxxx kernel: Stack:

Message from syslogd@xxxxxxxxxx at Thu Nov 11 17:00:05 2010 ...
xxxxxxxxxx kernel: Call Trace:

Message from syslogd@xxxxxxxxxx at Thu Nov 11 17:00:05 2010 ...
xxxxxxxxxx kernel: general protection fault: 0000 [#10] SMP

Message from syslogd@xxxxxxxxxx at Thu Nov 11 17:00:05 2010 ...
xxxxxxxxxx kernel: Stack:

Message from syslogd@xxxxxxxxxx at Thu Nov 11 17:00:05 2010 ...
xxxxxxxxxx kernel: Call Trace:

Message from syslogd@xxxxxxxxxx at Thu Nov 11 17:00:05 2010 ...
xxxxxxxxxx kernel: Code: 48 89 f3 49 89 d6 e8 41 33 3c 00 85 c0 75 1e 49 8b 95 f0 00 00 00 48 8b 8a 80 00 00 00 48 85 c9 74 22 4c 89 f2 48 89 de 4c 89 e7 <ff> d1 48 8b 5d e0 4c 8b 65 e8 4c 8b 6d f0 4c 8b 75 f8 c9 c3 0f

Message from syslogd@xxxxxxxxxx at Thu Nov 11 17:00:05 2010 ...
xxxxxxxxxx kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:15.0/0000:05:00.1/host8/target8:0:0/8:0:0:0/model

Message from syslogd@xxxxxxxxxx at Thu Nov 11 17:00:06 2010 ...
xxxxxxxxxx kernel: general protection fault: 0000 [#11] SMP

Message from syslogd@xxxxxxxxxx at Thu Nov 11 17:00:06 2010 ...
xxxxxxxxxx kernel: Call Trace:

Message from syslogd@xxxxxxxxxx at Thu Nov 11 17:00:06 2010 ...
xxxxxxxxxx kernel: Stack:

Message from syslogd@xxxxxxxxxx at Thu Nov 11 17:00:06 2010 ...
xxxxxxxxxx kernel: Code: 48 89 f3 49 89 d6 e8 41 33 3c 00 85 c0 75 1e 49 8b 95 f0 00 00 00 48 8b 8a 80 00 00 00 48 85 c9 74 22 4c 89 f2 48 89 de 4c 89 e7 <ff> d1 48 8b 5d e0 4c 8b 65 e8 4c 8b 6d f0 4c 8b 75 f8 c9 c3 0f

Message from syslogd@xxxxxxxxxx at Thu Nov 11 17:00:06 2010 ...
xxxxxxxxxx kernel: last sysfs file: /sys/devices/pci0000:00/0000:00:15.0/0000:05:00.1/host8/target8:0:0/8:0:0:0/model

not tested: can't exec
not tested: can't exec ./ifpromisc
not tested: can't exec ./chkwtmp
not tested: can't exec ./chklastlog
not tested: can't exec ./chkutmp

So, if nothing else, try a new kernel.
Memory modules ... all components are new, 2 months old. Did another memtest, everything ok, so it seems most likely that some hardware issue, right ?

PS: when it dies, no ssd, ftp, http, nothing, but the ping is working ...

Last edited by LyCC; 11-11-2010 at 08:27 AM.
 
Old 11-11-2010, 09:44 AM   #5
udaman
Member
 
Registered: Oct 2010
Location: New England, USA
Distribution: OpenSUSE/Slackware64/RHEL/Mythbuntu
Posts: 189

Rep: Reputation: 39
As Tex_Mex suggested maybe a newer kernel will solve your problem. I run Slack 13.1 and the latest stable kernel is 2.6.33.4. There is a newer kernel in Slack current called 2.6.35.?, but I don't run the current version.

How long has it been like this? Has it ever run correctly? What changes did you make since it last ran well?

Check that your system is updated and see if you still have problems. Then start looking at pieces of hardware.
 
Old 11-11-2010, 04:22 PM   #6
LyCC
LQ Newbie
 
Registered: Nov 2010
Posts: 6

Original Poster
Rep: Reputation: 0
Thank you for the tip about the kernel, the answer is yes and no.

I compilled the new kernet to amd K8 platform ... and now when i booted, after and started everything, i saw what the problem is,
the screen was flooded with errors. So, it pointed out the problem, it turned out then one of the cores in the CPU was bad, in fact,
if i understand correctly, the bus via what the data is transfered from L2 cache to L1 had a problem , so it corrected
via ECC untill it could, and when it couldn't any more ... that's that.

So i turned off the problematic core (nice to know that the extra money put in a motherboard pays off),
and at least now the erros are gone, hopfully it will run now untill we can get a new CPU (3 cores are better that 4 with crashes).
If it runs for more then 2 weeks now, then i dear to say that this was the problem.

Here is the new error log (the old kernel compilled to x86 64 didn't reported this problem with the CPU, but the new one did):
PS: the server is running much better now, more fluently.

Nov 11 23:33:45 xxxxxx kernel: MC0_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
Nov 11 23:33:45 xxxxxx kernel: Data Cache Error during L1 linefill from L2.
Nov 11 23:33:45 xxxxxx kernel: Transaction: data read, Type: data, Cache Level: L2
Nov 11 23:33:45 xxxxxx kernel: Disabling lock debugging due to kernel taint
Nov 11 23:33:45 xxxxxx kernel: MC1_STATUS: Corrected error, other errors lost: no, CPU context corrupt: no
Nov 11 23:33:45 xxxxxx kernel: Instruction Cache Error: Parity error during data load.
Nov 11 23:33:45 xxxxxx kernel: Transaction: inst fetch, Type: instruction, Cache Level: L1
Nov 11 23:33:45 xxxxxx kernel: MC2_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
Nov 11 23:33:45 xxxxxx kernel: Bus Unit Error: evict error during data copyback.
Nov 11 23:33:45 xxxxxx kernel: Transaction: evict, Type: generic, Cache Level: L2
Nov 11 23:35:05 xxxxxx kernel: CE: hpet increased min_delta_ns to 7500 nsec
Nov 11 23:36:15 xxxxxx kernel: MC0_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
Nov 11 23:36:15 xxxxxx kernel: Data Cache Error during L1 linefill from L2.
Nov 11 23:36:15 xxxxxx kernel: Transaction: data read, Type: data, Cache Level: L2
Nov 11 23:36:15 xxxxxx kernel: MC1_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no
Nov 11 23:36:15 xxxxxx kernel: Instruction Cache Error: Parity error during data load.
Nov 11 23:36:15 xxxxxx kernel: Transaction: inst fetch, Type: instruction, Cache Level: L1
Nov 11 23:36:15 xxxxxx kernel: MC2_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
Nov 11 23:36:15 xxxxxx kernel: Bus Unit Error: evict error during data copyback.
Nov 11 23:36:15 xxxxxx kernel: Transaction: evict, Type: generic, Cache Level: L2
Nov 11 23:37:30 xxxxxx kernel: MC0_STATUS: Corrected error, other errors lost: yes, CPU context corrupt: no, CECC Error
Nov 11 23:37:30 xxxxxx kernel: Data Cache Error during L1 linefill from L2.
Nov 11 23:37:30 xxxxxx kernel: Transaction: data read, Type: data, Cache Level: L2
Nov 11 23:37:30 xxxxxx kernel: MC1_STATUS: Corrected error, other errors lost: no, CPU context corrupt: no
Nov 11 23:37:30 xxxxxx kernel: Instruction Cache Error: Parity error during data load.

and so on for many more mbytes ...

Hopfully this was the problem, if not, i will check back.
Thank you all for your help.
 
Old 11-12-2010, 07:45 AM   #7
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
Interesting, this is something I have not seen before.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Wine 32 bit on Slackware 13 64 bit problems: configure can't find some libraries ozanbaba Slackware 28 11-26-2010 02:42 PM
[SOLVED] ati driver 32-bit compatibility for Slackware 64-bit multilib? vik Slackware 25 02-16-2010 07:06 PM
fedora 9 and ircd hybrid server luke9511 Fedora 1 09-04-2008 04:45 PM
Fedora 5 x86 64 bit Crash/hang RenjithV Fedora 5 06-25-2006 12:53 AM
ircd-hybrid server installing problems _stef_ Linux - General 4 06-26-2002 09:12 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 10:52 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration