LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   Random Reboots - Slackware 10.2 (https://www.linuxquestions.org/questions/linux-general-1/random-reboots-slackware-10-2-a-628934/)

tsg 03-18-2008 09:55 AM

Random Reboots - Slackware 10.2
 
I apologize if this has been done before. I did do a search but didn't find anything I haven't tried already. I am at my wits end....

My mail server, running Slackware 10.2, is rebooting for reasons I can't discover. /var/log/syslog and /var/log/messages show nothing helpful. /var/log/debug only shows the following:

Code:

Mar 18 04:53:45 mail kernel: CPU:    After generic, caps: 3febf9ff 00000000 00000000 00000000
Mar 18 04:53:45 mail kernel: CPU:            Common caps: 3febf9ff 00000000 00000000 00000000
Mar 18 04:53:45 mail kernel: eth0:  Identified 8139 chip type 'RTL-8100B/8139D'
Mar 18 04:53:48 mail kernel: 00:0a.0: tulip_stop_rxtx() failed

I am logging sensors and uptime to /var/log/messages and neither shows any anomalies. The load average leading up to the reboot is minimal.

Code:

Mar 18 10:30:01 mail sensors: w83697hf-isa-0290
Mar 18 10:30:01 mail sensors: Adapter: ISA adapter
Mar 18 10:30:01 mail sensors: VCore:    +1.65 V  (min =  +1.62 V, max =  +1.78 V)
Mar 18 10:30:01 mail sensors: +3.3V:    +3.22 V  (min =  +3.14 V, max =  +3.46 V)
Mar 18 10:30:01 mail sensors: +5V:      +4.97 V  (min =  +4.74 V, max =  +5.24 V)
Mar 18 10:30:01 mail sensors: +12V:    +11.63 V  (min = +10.83 V, max = +13.19 V)
Mar 18 10:30:01 mail sensors: -12V:    -11.72 V  (min = -13.16 V, max = -10.90 V)
Mar 18 10:30:01 mail sensors: V5SB:      +5.46 V  (min =  +4.94 V, max =  +6.05 V)
Mar 18 10:30:01 mail sensors: VBat:      +3.14 V  (min =  +2.40 V, max =  +3.60 V)
Mar 18 10:30:01 mail sensors: CPUFan:  3341 RPM  (min = 2986 RPM, div = 4)
Mar 18 10:30:01 mail sensors: CPUTemp:  +40.5 C  (high =  +63 C, hyst =  +58 C)  sensor = diode          (beep)
Mar 18 10:30:01 mail sensors: alarms:
Mar 18 10:30:01 mail sensors: beep_enable:
Mar 18 10:30:01 mail sensors:          Sound alarm enabled
Mar 18 10:30:01 mail sensors:

I have run memtest86[+], cpuburn, (both fine) and have smartctl running on the hard-drive with no errors.

The machine serves as my mail server, DNS server, and gateway to the internet. It's already running as bare-bones as I can make it with a fairly restrictive firewall. There doesn't seem to be any particular pattern to the reboots (eg, time of day, network traffic, etc) that I can determine.

The only clue I have is that I can make it reboot by doing a zless /var/log/messages.1.gz and then doing a search for "Mar 18" (typing in the command '/Mar 18' and hitting enter). Obviously I'm not doing that at 4am.

I have three other machines that have been up for 36 days, but this one reboots several times a day.

I'm leaning towards a hardware problem but I'm having trouble isolating it. The machine isn't terribly old (the reboots have only started in the past few months and don't coincide with any new software) and I'd rather not have to replace the entire thing if I can avoid it, especially if it turns out to be a software problem.

Any suggestions would be appreciated. If I forgot any information that might be helpful, let me know and I will post it.

Thank you in advance.

tsg 03-18-2008 11:49 AM

More info:

The error logs from smartctl show the following:

Code:

Error 269 occurred at disk power-on lifetime: 514 hours (21 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 32 b0 42 fd e1  Error: UNC 50 sectors at LBA = 0x01fd42b0 = 33374896

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 32 b0 42 fd e1 00      00:02:09.350  READ DMA
  c8 00 34 ae 42 fd e1 00      00:02:06.950  READ DMA
  c8 00 36 ac 42 fd e1 00      00:02:04.250  READ DMA
  c8 00 38 aa 42 fd e1 00      00:02:01.750  READ DMA
  c8 00 3a a8 42 fd e1 00      00:01:59.250  READ DMA

Error 268 occurred at disk power-on lifetime: 514 hours (21 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 34 b0 42 fd e1  Error: UNC 52 sectors at LBA = 0x01fd42b0 = 33374896

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 34 ae 42 fd e1 00      00:02:06.950  READ DMA
  c8 00 36 ac 42 fd e1 00      00:02:04.250  READ DMA
  c8 00 38 aa 42 fd e1 00      00:02:01.750  READ DMA
  c8 00 3a a8 42 fd e1 00      00:01:59.250  READ DMA
  c8 00 3c a6 42 fd e1 00      00:01:56.500  READ DMA

Error 267 occurred at disk power-on lifetime: 514 hours (21 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 36 b0 42 fd e1  Error: UNC 54 sectors at LBA = 0x01fd42b0 = 33374896

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 36 ac 42 fd e1 00      00:02:04.250  READ DMA
  c8 00 38 aa 42 fd e1 00      00:02:01.750  READ DMA
  c8 00 3a a8 42 fd e1 00      00:01:59.250  READ DMA
  c8 00 3c a6 42 fd e1 00      00:01:56.500  READ DMA
  c8 00 3e a4 42 fd e1 00      00:01:53.900  READ DMA

Error 266 occurred at disk power-on lifetime: 514 hours (21 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 38 b0 42 fd e1  Error: UNC 56 sectors at LBA = 0x01fd42b0 = 33374896

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 38 aa 42 fd e1 00      00:02:01.750  READ DMA
  c8 00 3a a8 42 fd e1 00      00:01:59.250  READ DMA
  c8 00 3c a6 42 fd e1 00      00:01:56.500  READ DMA
  c8 00 3e a4 42 fd e1 00      00:01:53.900  READ DMA
  c8 00 40 a2 42 fd e1 00      00:01:51.250  READ DMA

Error 265 occurred at disk power-on lifetime: 514 hours (21 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 3a b0 42 fd e1  Error: UNC 58 sectors at LBA = 0x01fd42b0 = 33374896

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 3a a8 42 fd e1 00      00:01:59.250  READ DMA
  c8 00 3c a6 42 fd e1 00      00:01:56.500  READ DMA
  c8 00 3e a4 42 fd e1 00      00:01:53.900  READ DMA
  c8 00 40 a2 42 fd e1 00      00:01:51.250  READ DMA
  c8 00 42 a0 42 fd e1 00      00:01:48.650  READ DMA

which makes me think the drive may be failing, or at least has a bad sector at the mentioned LBA. But the smartctl -A command shows:

Code:

smartctl version 5.33 [i686-pc-linux-gnu] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000b  200  165  051    Pre-fail  Always      -      0
  3 Spin_Up_Time            0x0007  118  095  021    Pre-fail  Always      -      1475
  4 Start_Stop_Count        0x0032  100  100  040    Old_age  Always      -      697
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x000b  200  200  051    Pre-fail  Always      -      0
  9 Power_On_Hours          0x0032  049  049  000    Old_age  Always      -      37725
 10 Spin_Retry_Count        0x0013  100  100  051    Pre-fail  Always      -      0
 11 Calibration_Retry_Count 0x0013  100  100  051    Pre-fail  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      498
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0
197 Current_Pending_Sector  0x0012  200  200  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0012  200  200  000    Old_age  Always      -      0
199 UDMA_CRC_Error_Count    0x000a  200  253  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0009  200  200  051    Pre-fail  Offline      -      0

and the health status reads "PASSED". According to what I can tell from the man page for smartctl, none of these values indicate a problem.

I am running periodic self-tests which don't seem to show any problems.

Code:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline      Completed without error      00%      578        -
# 2  Extended offline    Completed without error      00%      508        -
# 3  Short offline      Completed without error      00%      507        -
# 4  Short offline      Completed without error      00%      483        -
# 5  Short offline      Completed without error      00%      460        -
# 6  Short offline      Completed without error      00%      436        -
# 7  Short offline      Completed without error      00%      413        -
# 8  Short offline      Completed without error      00%      390        -
# 9  Extended offline    Completed without error      00%      346        -
#10  Short offline      Completed without error      00%      344        -
#11  Short offline      Completed without error      00%      321        -
#12  Short offline      Completed without error      00%      297        -
#13  Short offline      Completed without error      00%      273        -
#14  Short offline      Completed without error      00%      250        -
#15  Short offline      Completed without error      00%      227        -
#16  Short offline      Completed without error      00%      226        -
#17  Short offline      Completed without error      00%      203        -
#18  Extended offline    Completed without error      00%      181        -
#19  Short offline      Completed without error      00%      180        -
#20  Short offline      Completed without error      00%      156        -
#21  Short offline      Completed without error      00%      133        -

Am I reading this correctly?


All times are GMT -5. The time now is 06:19 PM.