LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 05-17-2009, 07:33 AM   #1
baldy3105
Member
 
Registered: Jan 2003
Location: Cambridgeshire, UK
Distribution: Mint (Desktop), Debian (Server)
Posts: 891

Rep: Reputation: 184Reputation: 184
Ubuntu Server Crashing


Hi All,

I have and Ubuntu 8.10 server which is crashing once every few days. When it goes I get a screen full of what looks like debug information, and it appears, as far as I can tell, to be the same each time. Problem is the machine is completely frozen, even ctrl-alt-del does nothing.

When I reboot I can find no trace of anything that looks like what I'm seeing on screen in any of the logs. Does this crash information get dumped anywhere that I can look at?

Going to take the server down tomorrow and memtest it for the day, but given that there is crash log info and it appears to be similar each time, doesn't that suggest the problem is not hardware related? Not CPU or memory anyway.

Thanks

Pete
 
Old 05-17-2009, 04:06 PM   #2
cyent
Member
 
Registered: Aug 2001
Location: ChristChurch New Zealand
Distribution: Ubuntu
Posts: 398

Rep: Reputation: 87
The memtest idea is good. Very good.

Can you post the crash info somehow. That will be a big assist in helping you.

Places to look are in...
sudo bash
cd /var/log
ls -lrt

Then look at logs touched around the time of the crash.

Otherwise, GASP, SCHLOCK, HORROR! You'll have to resort to those old pencil and paper things. :-) If you haven't got one mouldering in a draw somewhere you can drag a laptop over to the dead server and manually transcribe stuff directly. :-)


Google for the magic hdparm trick for checking the S.M.A.R.T. stuff on the drives. Check to see if the drives aren't busy dying.
 
Old 05-18-2009, 04:43 AM   #3
baldy3105
Member
 
Registered: Jan 2003
Location: Cambridgeshire, UK
Distribution: Mint (Desktop), Debian (Server)
Posts: 891

Original Poster
Rep: Reputation: 184Reputation: 184
Hi,

Can I post the crash info somehow, was basically my question! I could write down whats on the screen, but I get the impression that its the tail end of something and as the machine has frozen I can't scroll it! I was hoping there would be a simple location of that console output duplicated in a log file, but it has eluded me so far.

My next plan is to output console info to the serial port and capture it to another machine. I just have to work out how, but I've found a few web pages to read about that.

Its been memtesting for a couple of hours now, clean, problem is how long to leave it, if its crashing every couple of days, then I guesss I need to leave it for at least three days :-(

I've had smartd running on the machine for a few weeks and while the /dev/sdc partition keeps getting knocked out of my raid array when it crashes, the extended smart tests have never shown an issue with the drives, so I'm labelling that as a symptom at the moment, not a cause.

I need to find a program to stress test the disk I/O, I'm hoping there is something suitable on Ultimate Boot CD, or Gparted Magic or the like.

Thanks for the input anyway, I wanted to make sure I wasn't going about this the hard way when there was a short cut!

Cheers

Pete
 
Old 05-18-2009, 10:47 AM   #4
baldy3105
Member
 
Registered: Jan 2003
Location: Cambridgeshire, UK
Distribution: Mint (Desktop), Debian (Server)
Posts: 891

Original Poster
Rep: Reputation: 184Reputation: 184
Thumbs up

This is what I got from capturing the console output via serial port.

For anyone thats interested, this output did not make it into any log that I could find, this was the only way I could capture it.

I used the kernel option "console=" in grub to duplicate the console output to my serial port. The last console that you specify is the interactive one that you can log into, so if you want to use your system as normal while sending console output to serial port 1 you would use "console=ttyS0,38400n8 console=tty0" on the end of your kernel options.

If you only specify the serial port you will not be able to log in via the keyboard screen (allegedly, I didn't try it). I also read that doing so can also cause Redhats hardware detection to throw a wobbler, fyi.

Quote:
4478.200037] ata4.00: exception Emask 0x12 SAct 0x0 SErr 0x4850400 action 0xe frozen
[ 4478.223047] ata4: SError: { Proto PHYRdyChg CommWake LinkSeq DevExch }
[ 4478.242628] ata4.00: cmd c8/00:20:3f:00:00/00:00:00:00:00/e0 tag 0 dma 16384 in
[ 4478.242630] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x16 (ATA bus error)
[ 4478.288580] ata4.00: status: { DRDY }
[ 4509.330037] ata4.00: exception Emask 0x12 SAct 0x0 SErr 0x4850400 action 0xe frozen
[ 4509.353029] ata4: SError: { Proto PHYRdyChg CommWake LinkSeq DevExch }
[ 4509.372625] ata4.00: cmd c8/00:20:3f:00:00/00:00:00:00:00/e0 tag 0 dma 16384 in
[ 4509.372627] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x16 (ATA bus error)
[ 4509.418578] ata4.00: status: { DRDY }
[ 4585.540035] ata4.00: exception Emask 0x12 SAct 0x0 SErr 0x4850400 action 0xe frozen
[ 4585.563025] ata4: SError: { Proto PHYRdyChg CommWake LinkSeq DevExch }
[ 4585.582605] ata4.00: cmd ca/00:00:3f:00:00/00:00:00:00:00/e0 tag 0 dma 131072 out
[ 4585.582607] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x16 (ATA bus error)
[ 4585.629072] ata4.00: status: { DRDY }
[ 6932.610034] ata4.00: exception Emask 0x12 SAct 0x0 SErr 0x48d0401 action 0xe frozen
[ 6932.633027] ata4: SError: { RecovData Proto PHYRdyChg CommWake 10B8B LinkSeq DevExch }
[ 6932.656788] ata4.00: cmd 35/00:00:bf:7f:ad/00:04:0c:00:00/e0 tag 0 dma 524288 out
[ 6932.656789] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x16 (ATA bus error)
[ 6932.703252] ata4.00: status: { DRDY }
[ 7079.890001]
[ 7079.890001] HARDWARE ERROR
[ 7079.890001] CPU 0: Machine Check Exception: 4 Bank 4: b200000000070f0f
[ 7079.890001] TSC bf21d0b4fc2
[ 7079.890001] This is not a software problem!
[ 7079.890001] Run through mcelog --ascii to decode and contact your hardware vendor
[ 7079.890001] Kernel panic - not syncing: Machine check
[ 7079.890001] ------------[ cut here ]------------
[ 7079.890001] WARNING: at /build/buildd/linux-2.6.27/kernel/smp.c:332 smp_call_function_mask+0x22c/0x240()
[ 7079.890001] Modules linked in: ipv6 iptable_filter ip_tables x_tables reiserfs sbp2 ieee1394 parport_pc lp parport lo
op evdev psmouse pcspkr serio_raw k8temp snd_intel8x0 snd_ac97_codec ac97_bus button snd_pcm snd_timer snd soundcore snd
_page_alloc i2c_ali15x3 shpchp i2c_ali1563 i2c_ali1535 pci_hotplug i2c_core ext3 jbd mbcache usb_storage sr_mod cdrom sd
_mod crc_t10dif sg sata_uli ata_generic libusual pata_ali pata_acpi ehci_hcd ohci_hcd libata usbcore scsi_mod dock 8139c
p 8139too mii raid10 raid456 async_xor async_memcpy async_tx xor raid1 raid0 multipath linear md_mod thermal processor f
an fbcon tileblit font bitblit softcursor fuse
[ 7079.890001] Pid: 2085, comm: md0_raid1 Tainted: G M 2.6.27-7-server #1
[ 7079.890001]
[ 7079.890001] Call Trace:
[ 7079.890001] <#MC> [<ffffffff8024e9b4>] warn_on_slowpath+0x64/0x90
[ 7079.890001] [<ffffffff8024f149>] ? wake_up_klogd+0x9/0x40
[ 7079.890001] [<ffffffff8024f384>] ? release_console_sem+0x204/0x210
[ 7079.890001] [<ffffffff8027852c>] smp_call_function_mask+0x22c/0x240
[ 7079.890001] [<ffffffff80225520>] ? stop_this_cpu+0x0/0x40
[ 7079.890001] [<ffffffff80500b19>] ? mutex_unlock+0x9/0x20
[ 7079.890001] [<ffffffff80284a44>] ? crash_kexec+0x74/0x100
[ 7079.890001] [<ffffffff80278560>] smp_call_function+0x20/0x30
[ 7079.890001] [<ffffffff802254e8>] native_smp_send_stop+0x28/0x60
[ 7079.890001] [<ffffffff804ff93d>] panic+0xb4/0x171
[ 7079.890001] [<ffffffff8024ea36>] ? do_oops_enter_exit+0x16/0x100
[ 7079.890001] [<ffffffff8024eb53>] ? oops_enter+0x13/0x20
[ 7079.890001] [<ffffffff80502ce4>] ? oops_begin+0x94/0xb0
[ 7079.890001] [<ffffffff8021fa89>] ? print_mce+0x89/0x110
[ 7079.890001] [<ffffffff8021fb8b>] mce_panic+0x7b/0x80
[ 7079.890001] [<ffffffff802201fa>] do_machine_check+0x4fa/0x540
[ 7079.890001] [<ffffffffa01410f0>] ? ata_scsi_flush_xlat+0x0/0x40 [libata]
[ 7079.890001] [<ffffffff80214132>] machine_check+0xa2/0xd0
[ 7079.890001] [<ffffffffa01410f0>] ? ata_scsi_flush_xlat+0x0/0x40 [libata]
[ 7079.890001] [<ffffffff803ae650>] ? iowrite8+0x20/0x50
[ 7079.890001] <<EOE>> [<ffffffffa014b514>] ata_sff_exec_command+0x24/0x40 [libata]
[ 7079.890001] [<ffffffffa014d973>] ata_sff_qc_issue+0x233/0x2b0 [libata]
[ 7079.890001] [<ffffffffa013bf60>] ata_qc_issue+0x1e0/0x260 [libata]
[ 7079.890001] [<ffffffffa00df8a0>] ? scsi_done+0x0/0x30 [scsi_mod]
[ 7079.890001] [<ffffffffa0142754>] ata_scsi_translate+0xb4/0x1a0 [libata]
[ 7079.890001] [<ffffffffa00df8a0>] ? scsi_done+0x0/0x30 [scsi_mod]
[ 7079.890001] [<ffffffffa014525e>] ata_scsi_queuecmd+0xbe/0x2e0 [libata]
[ 7079.890001] [<ffffffffa00dfb5b>] scsi_dispatch_cmd+0x11b/0x2e0 [scsi_mod]
[ 7079.890001] [<ffffffffa00e811f>] scsi_request_fn+0x30f/0x460 [scsi_mod]
[ 7079.890001] [<ffffffff8038f0e1>] elv_insert+0x151/0x2f0
[ 7079.890001] [<ffffffff8038f2f5>] __elv_add_request+0x75/0xc0
[ 7079.890001] [<ffffffff8039270e>] __make_request+0xce/0x430
[ 7079.890001] [<ffffffff803916e9>] generic_make_request+0x369/0x490
[ 7079.890001] [<ffffffff80318df3>] ? bvec_alloc_bs+0x83/0x110
[ 7079.890001] [<ffffffff80391891>] submit_bio+0x81/0x120
[ 7079.890001] [<ffffffff80318f77>] ? bio_clone+0x47/0x80
[ 7079.890001] [<ffffffffa005160e>] md_super_write+0xde/0xe0 [md_mod]
[ 7079.890001] [<ffffffffa0052936>] md_update_sb+0x286/0x3c0 [md_mod]
[ 7079.890001] [<ffffffffa0058b2a>] md_check_recovery+0x33a/0x6d0 [md_mod]
[ 7079.890001] [<ffffffffa007bbae>] raid1d+0x2e/0x430 [raid1]
[ 7079.890001] [<ffffffff80500815>] ? schedule_timeout+0x95/0xd0
[ 7079.890001] [<ffffffffa0050edc>] md_thread+0x5c/0x140 [md_mod]
[ 7079.890001] [<ffffffff80266fb0>] ? autoremove_wake_function+0x0/0x40
[ 7079.890001] [<ffffffffa0050e80>] ? md_thread+0x0/0x140 [md_mod]
[ 7079.890001] [<ffffffff80266b7e>] kthread+0x4e/0x90
[ 7079.890001] [<ffffffff80213c99>] child_rip+0xa/0x11
[ 7079.890001] [<ffffffff80266b30>] ? kthread+0x0/0x90
[ 7079.890001] [<ffffffff80213c8f>] ? child_rip+0x0/0x11
[ 7079.890001]
[ 7079.890001] ---[ end trace eae4b87e111b79c7 ]---
[ 7086.427528] Clocksource tsc unstable (delta = 4687006179 ns)
Its pretty clear on it NOT being a software fault but a hardware one (well it WOULD say that wouldn't it! ;-)) but I'm not totally clear on what the error actually was. I'm assuming that a CPU machine check exception means an internal CPU error.

Am I reading this right?
 
Old 05-18-2009, 04:00 PM   #5
cyent
Member
 
Registered: Aug 2001
Location: ChristChurch New Zealand
Distribution: Ubuntu
Posts: 398

Rep: Reputation: 87
If memtest is coming up clean, I'd say it's the disk hardware.

In the old days it could be the disk controller, but since they have integrated them onto the motherboard I haven't seen a disk controller go flaky.

Check the connectors between the disk and motherboard, if that's good, sorry, I think it is time for a new disk.
 
Old 05-21-2009, 04:00 AM   #6
baldy3105
Member
 
Registered: Jan 2003
Location: Cambridgeshire, UK
Distribution: Mint (Desktop), Debian (Server)
Posts: 891

Original Poster
Rep: Reputation: 184Reputation: 184
OK, so it looks like I've resolved the issue. You were definitely right to be looking at the HD / Controller side of things, however Ubuntu was misleading us all the while.

I started reversing any changes I had made since I first installed the base system and one of those things was that I hooked the PATA HD from my old server into the new one to copy various stuff across to the new SATA RAID array.

The server has now been stable for 3 days, I have managed to recover the RAID array without a crash for first time in several weeks and my log is blissfully free of error messages.

A bit of googling reveals that others have had problems mixing SATA and PATA disks on Ubuntu, although not to the point of crashing but they've definitely seen the ATA bus errors.

So at present it looks like an issue with the disk controllers/ drivers handling PATA and SATA simultaneously was causing bus errors and eventually was corrupting one of my raid partition and crashing my system.

I have not had this setup running any other Distro or OS so can't prove whether it was HW/SW, however this issue appears to be isolated to Ubuntu, so it looks to me like a SW issue, Ubuntu lied when it said this is not a software fault.

Thought I'd finish the thread in case it helps anyone else.

Thanks for your input.

Last edited by baldy3105; 05-21-2009 at 04:01 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Ubuntu Server 8.04 - Display keeps crashing. algogeek Linux - Server 6 02-19-2009 10:18 AM
Amarok keeps crashing on Ubuntu The Other Guy Linux - Software 2 03-29-2008 02:42 PM
Azureus in Ubuntu 7.04 Keeps Crashing! trox Linux - Software 1 07-11-2007 08:31 AM
UBUNTU 64 BITS, X crashing javb Linux - Newbie 1 03-23-2005 04:59 PM
regarding pc crashing in ubuntu pranith Debian 1 03-03-2005 09:19 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 09:48 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration