LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 03-18-2008, 09:55 AM   #1
tsg
Member
 
Registered: Mar 2008
Posts: 155

Rep: Reputation: 30
Random Reboots - Slackware 10.2


I apologize if this has been done before. I did do a search but didn't find anything I haven't tried already. I am at my wits end....

My mail server, running Slackware 10.2, is rebooting for reasons I can't discover. /var/log/syslog and /var/log/messages show nothing helpful. /var/log/debug only shows the following:

Code:
Mar 18 04:53:45 mail kernel: CPU:     After generic, caps: 3febf9ff 00000000 00000000 00000000
Mar 18 04:53:45 mail kernel: CPU:             Common caps: 3febf9ff 00000000 00000000 00000000
Mar 18 04:53:45 mail kernel: eth0:  Identified 8139 chip type 'RTL-8100B/8139D'
Mar 18 04:53:48 mail kernel: 00:0a.0: tulip_stop_rxtx() failed
I am logging sensors and uptime to /var/log/messages and neither shows any anomalies. The load average leading up to the reboot is minimal.

Code:
Mar 18 10:30:01 mail sensors: w83697hf-isa-0290
Mar 18 10:30:01 mail sensors: Adapter: ISA adapter
Mar 18 10:30:01 mail sensors: VCore:     +1.65 V  (min =  +1.62 V, max =  +1.78 V)
Mar 18 10:30:01 mail sensors: +3.3V:     +3.22 V  (min =  +3.14 V, max =  +3.46 V)
Mar 18 10:30:01 mail sensors: +5V:       +4.97 V  (min =  +4.74 V, max =  +5.24 V)
Mar 18 10:30:01 mail sensors: +12V:     +11.63 V  (min = +10.83 V, max = +13.19 V)
Mar 18 10:30:01 mail sensors: -12V:     -11.72 V  (min = -13.16 V, max = -10.90 V)
Mar 18 10:30:01 mail sensors: V5SB:      +5.46 V  (min =  +4.94 V, max =  +6.05 V)
Mar 18 10:30:01 mail sensors: VBat:      +3.14 V  (min =  +2.40 V, max =  +3.60 V)
Mar 18 10:30:01 mail sensors: CPUFan:   3341 RPM  (min = 2986 RPM, div = 4)
Mar 18 10:30:01 mail sensors: CPUTemp:   +40.5 C  (high =   +63 C, hyst =   +58 C)   sensor = diode           (beep)
Mar 18 10:30:01 mail sensors: alarms:
Mar 18 10:30:01 mail sensors: beep_enable:
Mar 18 10:30:01 mail sensors:           Sound alarm enabled
Mar 18 10:30:01 mail sensors:
I have run memtest86[+], cpuburn, (both fine) and have smartctl running on the hard-drive with no errors.

The machine serves as my mail server, DNS server, and gateway to the internet. It's already running as bare-bones as I can make it with a fairly restrictive firewall. There doesn't seem to be any particular pattern to the reboots (eg, time of day, network traffic, etc) that I can determine.

The only clue I have is that I can make it reboot by doing a zless /var/log/messages.1.gz and then doing a search for "Mar 18" (typing in the command '/Mar 18' and hitting enter). Obviously I'm not doing that at 4am.

I have three other machines that have been up for 36 days, but this one reboots several times a day.

I'm leaning towards a hardware problem but I'm having trouble isolating it. The machine isn't terribly old (the reboots have only started in the past few months and don't coincide with any new software) and I'd rather not have to replace the entire thing if I can avoid it, especially if it turns out to be a software problem.

Any suggestions would be appreciated. If I forgot any information that might be helpful, let me know and I will post it.

Thank you in advance.
 
Old 03-18-2008, 11:49 AM   #2
tsg
Member
 
Registered: Mar 2008
Posts: 155

Original Poster
Rep: Reputation: 30
More info:

The error logs from smartctl show the following:

Code:
Error 269 occurred at disk power-on lifetime: 514 hours (21 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 32 b0 42 fd e1  Error: UNC 50 sectors at LBA = 0x01fd42b0 = 33374896

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 32 b0 42 fd e1 00      00:02:09.350  READ DMA
  c8 00 34 ae 42 fd e1 00      00:02:06.950  READ DMA
  c8 00 36 ac 42 fd e1 00      00:02:04.250  READ DMA
  c8 00 38 aa 42 fd e1 00      00:02:01.750  READ DMA
  c8 00 3a a8 42 fd e1 00      00:01:59.250  READ DMA

Error 268 occurred at disk power-on lifetime: 514 hours (21 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 34 b0 42 fd e1  Error: UNC 52 sectors at LBA = 0x01fd42b0 = 33374896

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 34 ae 42 fd e1 00      00:02:06.950  READ DMA
  c8 00 36 ac 42 fd e1 00      00:02:04.250  READ DMA
  c8 00 38 aa 42 fd e1 00      00:02:01.750  READ DMA
  c8 00 3a a8 42 fd e1 00      00:01:59.250  READ DMA
  c8 00 3c a6 42 fd e1 00      00:01:56.500  READ DMA

Error 267 occurred at disk power-on lifetime: 514 hours (21 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 36 b0 42 fd e1  Error: UNC 54 sectors at LBA = 0x01fd42b0 = 33374896

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 36 ac 42 fd e1 00      00:02:04.250  READ DMA
  c8 00 38 aa 42 fd e1 00      00:02:01.750  READ DMA
  c8 00 3a a8 42 fd e1 00      00:01:59.250  READ DMA
  c8 00 3c a6 42 fd e1 00      00:01:56.500  READ DMA
  c8 00 3e a4 42 fd e1 00      00:01:53.900  READ DMA

Error 266 occurred at disk power-on lifetime: 514 hours (21 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 38 b0 42 fd e1  Error: UNC 56 sectors at LBA = 0x01fd42b0 = 33374896

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 38 aa 42 fd e1 00      00:02:01.750  READ DMA
  c8 00 3a a8 42 fd e1 00      00:01:59.250  READ DMA
  c8 00 3c a6 42 fd e1 00      00:01:56.500  READ DMA
  c8 00 3e a4 42 fd e1 00      00:01:53.900  READ DMA
  c8 00 40 a2 42 fd e1 00      00:01:51.250  READ DMA

Error 265 occurred at disk power-on lifetime: 514 hours (21 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 3a b0 42 fd e1  Error: UNC 58 sectors at LBA = 0x01fd42b0 = 33374896

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 3a a8 42 fd e1 00      00:01:59.250  READ DMA
  c8 00 3c a6 42 fd e1 00      00:01:56.500  READ DMA
  c8 00 3e a4 42 fd e1 00      00:01:53.900  READ DMA
  c8 00 40 a2 42 fd e1 00      00:01:51.250  READ DMA
  c8 00 42 a0 42 fd e1 00      00:01:48.650  READ DMA
which makes me think the drive may be failing, or at least has a bad sector at the mentioned LBA. But the smartctl -A command shows:

Code:
smartctl version 5.33 [i686-pc-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   200   165   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0007   118   095   021    Pre-fail  Always       -       1475
  4 Start_Stop_Count        0x0032   100   100   040    Old_age   Always       -       697
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   049   049   000    Old_age   Always       -       37725
 10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0013   100   100   051    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       498
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   200   200   051    Pre-fail  Offline      -       0
and the health status reads "PASSED". According to what I can tell from the man page for smartctl, none of these values indicate a problem.

I am running periodic self-tests which don't seem to show any problems.

Code:
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       578         -
# 2  Extended offline    Completed without error       00%       508         -
# 3  Short offline       Completed without error       00%       507         -
# 4  Short offline       Completed without error       00%       483         -
# 5  Short offline       Completed without error       00%       460         -
# 6  Short offline       Completed without error       00%       436         -
# 7  Short offline       Completed without error       00%       413         -
# 8  Short offline       Completed without error       00%       390         -
# 9  Extended offline    Completed without error       00%       346         -
#10  Short offline       Completed without error       00%       344         -
#11  Short offline       Completed without error       00%       321         -
#12  Short offline       Completed without error       00%       297         -
#13  Short offline       Completed without error       00%       273         -
#14  Short offline       Completed without error       00%       250         -
#15  Short offline       Completed without error       00%       227         -
#16  Short offline       Completed without error       00%       226         -
#17  Short offline       Completed without error       00%       203         -
#18  Extended offline    Completed without error       00%       181         -
#19  Short offline       Completed without error       00%       180         -
#20  Short offline       Completed without error       00%       156         -
#21  Short offline       Completed without error       00%       133         -
Am I reading this correctly?
 
  


Reply

Tags
random, reboot, slackware



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
random reboots - HP DL320 G4 tgv1968 Linux - Newbie 2 08-18-2006 08:20 AM
OpenSUSE 10.0 Random Reboots [42]Sanf0rd SUSE / openSUSE 0 06-09-2006 10:04 PM
System causes random reboots PiP42o Linux - General 3 11-16-2004 02:10 PM
random reboots rclawson Mandriva 3 10-26-2003 08:09 AM
Random Reboots Kernel_Sanders Linux - Hardware 2 07-08-2003 04:13 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 04:11 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration