LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 01-03-2008, 04:20 AM   #1
bioe007
Member
 
Registered: Apr 2006
Location: lynnwood, wa - usa
Distribution: archlinux
Posts: 654

Rep: Reputation: 30
Post is my HD dying? smartctl test failed?


box: p3 500MHz, slackware11, linux2.6.21.3

i noticed this while sshd into my server

dmesg | grep hda
Code:
      ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:DMA
hda: FUJITSU MPE3173AE, ATA DISK drive
hda: max request size: 128KiB
hda: 33867188 sectors (17340 MB) w/512KiB Cache, CHS=33598/16/63, UDMA(33)
hda: cache flushes not supported
 hda: hda1 hda2 hda3 hda4 < hda5 hda6 >
ReiserFS: hda2: found reiserfs format "3.6" with standard journal
ReiserFS: hda2: using ordered data mode
ReiserFS: hda2: journal params: device hda2, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
ReiserFS: hda2: checking transaction log (hda2)
ReiserFS: hda2: Using r5 hash to sort names
Adding 730948k swap on /dev/hda3.  Priority:-1 extents:1 across:730948k
ReiserFS: hda5: found reiserfs format "3.6" with standard journal
ReiserFS: hda5: using ordered data mode
ReiserFS: hda5: journal params: device hda5, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
ReiserFS: hda5: checking transaction log (hda5)
ReiserFS: hda5: Using r5 hash to sort names
ReiserFS: hda6: found reiserfs format "3.6" with standard journal
ReiserFS: hda6: using ordered data mode
ReiserFS: hda6: journal params: device hda6, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30
ReiserFS: hda6: checking transaction log (hda6)
ReiserFS: hda6: Using r5 hash to sort names
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x04 { DriveStatusError }
hda: set_drive_speed_status: status=0x51 { DriveReady SeekComplete Error }
hda: set_drive_speed_status: error=0xd0 { BadSector UncorrectableError SectorIdNotFound }, LBAsect=13684944, sector=28734331
hda: CHECK for good STATUS

so, it seemed prudent to check the drive with smartctl, which I don't know really anything about, this is as far as I have gotten, and it doesnt look good to me..
Code:
 smartctl --attributes --log=selftest --quietmode=errorsonly /dev/hda
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Seconds        0x0012   016   016   020    Old_age   Always   FAILING_NOW 12614h+09m+19s

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       40%     12613         348977
and I guess the 'longer' version?
Code:
smartctl --all /dev/hda
smartctl version 5.36 [i486-slackware-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Fujitsu MPD and MPE series
Device Model:     FUJITSU MPE3173AE
Serial Number:    05005503
Firmware Version: EE-C0-23
User Capacity:    17,340,000,256 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   4
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Thu Jan  3 17:27:18 2008 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 116) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline 
data collection:                 ( 300) seconds.
Offline data collection
capabilities:                    (0x1b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        No General Purpose Logging support.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  16) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   099   099   032    Pre-fail  Always       -       110486
  2 Throughput_Performance  0x0005   100   100   020    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   093   090   025    Pre-fail  Always       -       2
  4 Start_Stop_Count        0x0012   096   096   016    Old_age   Always       -       2502
  5 Reallocated_Sector_Ct   0x0033   100   100   024    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   020    Pre-fail  Always       -       1831
  8 Seek_Time_Performance   0x0005   100   100   019    Pre-fail  Offline      -       0
  9 Power_On_Seconds        0x0012   016   016   020    Old_age   Always   FAILING_NOW 12614h+07m+39s
 10 Spin_Retry_Count        0x0013   100   100   020    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   084   084   020    Old_age   Always       -       2407
196 Reallocated_Event_Count 0x0033   100   100   024    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0010   100   100   020    Old_age   Offline      -       0
198 Offline_Uncorrectable   0x0010   100   100   020    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   197    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000b   100   097   020    Pre-fail  Always       -       2233

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       40%     12613         348977

Device does not support Selective Self Tests/Logging

and hdparm
Code:
hdparm -i /dev/hda

/dev/hda:

 Model=FUJITSU MPE3173AE, FwRev=EE-C0-23, SerialNo=05005503
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
 BuffType=unknown, BuffSize=512kB, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=33867188
 IORDY=yes, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4 
 DMA modes:  mdma0 mdma1 mdma2 
 UDMA modes: udma0 udma1 *udma2 
 AdvancedPM=yes: disabled (255) WriteCache=enabled
 Drive conforms to: Unspecified:  ATA/ATAPI-1 ATA/ATAPI-2 ATA/ATAPI-3 ATA/ATAPI-4

 * signifies the current active mode
so in short, should I be out HD shopping now?
 
Old 01-03-2008, 06:01 AM   #2
aus9
LQ 5k Club
 
Registered: Oct 2003
Location: Australia
Distribution: IceWM on Debian
Posts: 5,488

Rep: Reputation: Disabled
http://www.fujitsu.com/us/services/c...utilities.html

grab the dos zip file and then create a bootable disk such as from http://bootdisk.com/ or if you do not have a fd then make a cd based on the same principle as a Fd.
 
Old 01-03-2008, 11:13 AM   #3
bioe007
Member
 
Registered: Apr 2006
Location: lynnwood, wa - usa
Distribution: archlinux
Posts: 654

Original Poster
Rep: Reputation: 30
thanks for that, but this box is headless. no monitor/keyboard. certainly can't boot a dos disk from there.
 
Old 01-03-2008, 01:32 PM   #4
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292
To me it looks like it is failing, at least according to SMART, which actually is quite reliable at predicting HDD failiure.

I say backup the data ASAP and be prepared for the worst.
 
Old 01-03-2008, 11:48 PM   #5
bioe007
Member
 
Registered: Apr 2006
Location: lynnwood, wa - usa
Distribution: archlinux
Posts: 654

Original Poster
Rep: Reputation: 30
thats good enough for me.. thanks

an additional question, I've copied /dev/hda* (where /, /boot, /usr were on separate partitions) to /dev/hdb1.

then umount hdb1, cfdisk and toggle hdb1 bootable. remount hdb1 and chroot into it, edit lilo.conf and then wrote lilo to hdb's MBR.

I plan on (for the time being) letting the box run until it dies or I get a new HD.

now I'm worrying about fstab. should I change /dev/hdb1/fstab to reflect root is on /dev/hda1 or can I expect that lilo will boot from /dev/hdb correctly if /dev/hda fails entirely ?

Last edited by bioe007; 01-04-2008 at 12:45 AM.
 
Old 01-04-2008, 05:43 AM   #6
aus9
LQ 5k Club
 
Registered: Oct 2003
Location: Australia
Distribution: IceWM on Debian
Posts: 5,488

Rep: Reputation: Disabled
it depends on bios, if bios allows a ide drive to be a slave and no other bootable devices found and still boots....then leave jumper as is and fstab as is.

But in the long run, you are better off removing dead drive, working drive set to master and use a live cd to edit fstab to hda settings.

having a live cd is always good, I am not sure if zenwalk has a "recovery" mode or not.

a small d/l live cd is RIP but you can find lots of live cds at www.distrowatch.com
 
Old 01-04-2008, 09:01 AM   #7
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292
Actually, can you try running a 'long' test on it, it'll tell you more accurately whether it will fail soon or not.

The SMART test actually said:
Code:
SMART overall-health self-assessment test result: PASSED
This means it will not fail according to current data. Yet, you have errors from a self-test. This means you should probably run a long test and see if more errors appear. Really, you should probably have run more tests earlier, like once every few months, this will give a more accurate picture of the drive status.

So, don't actually go and buy a new drive, yet.

As for lilo.conf, I would add two sections in there one booting off of each disk. And for fstab, try to keep the drives not dependent on one another.

EDIT: Also look at the man page for smartctl more on the meaning of all this stuff:
Quote:
The Attribute table printed out by smartctl also shows the
"TYPE" of the Attribute. Attributes are one of two possible
types: Pre-failure or Old age. Pre-failure Attributes are ones
which, if less than or equal to their threshold values, indicate
pending disk failure. Old age, or usage Attributes, are ones
which indicate end-of-product life from old-age or normal aging
and wearout, if the Attribute value is less than or equal to the
threshold. Please note: the fact that an Attribute is of type
'Pre-fail' does not mean that your disk is about to fail! It
only has this meaning if the Attribute´s current Normalized
value is less than or equal to the threshold value.

If the Attribute´s current Normalized value is less than or
equal to the threshold value, then the "WHEN_FAILED" column will
display "FAILING_NOW". If not, but the worst recorded value is
less than or equal to the threshold value, then this column will
display "In_the_past". If the "WHEN_FAILED" column has no entry
(indicated by a dash: ´-´) then this Attribute is OK now (not
failing) and has also never failed in the past.

The table column labeled "UPDATED" shows if the SMART Attribute
values are updated during both normal operation and off-line
testing, or only during offline testing. The former are labeled
"Always" and the latter are labeled "Offline".
So in your case
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Seconds        0x0012   016   016   020    Old_age   Always   FAILING_NOW 12614h+09m+19s
This means the drive is getting really old and is reaching the end of it's life. Not necessarily that it will fail, but that its days of proper functioning are numbered.

Last edited by H_TeXMeX_H; 01-04-2008 at 10:04 AM.
 
Old 01-04-2008, 11:19 AM   #8
bioe007
Member
 
Registered: Apr 2006
Location: lynnwood, wa - usa
Distribution: archlinux
Posts: 654

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by aus9
having a live cd is always good, I am not sure if zenwalk has a "recovery" mode or not.
fwiw, this box has slackware on it, and again the headless thing makes it difficult to run a livecd or recovery mode. my personal 'recovery' mode is just to boot a slack install disk then chroot into the actual OS - difficult without a keyboard and monitor

Quote:
Originally Posted by H_TeXMeX_H
Actually, can you try running a 'long' test on it, it'll tell you more accurately whether it will fail soon or not.
i did run a long test, same failure.

Code:
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       40%     12614         348976
# 2  Short offline       Completed: read failure       40%     12614         348976
# 3  Short offline       Completed: read failure       40%     12613         348977
I'm still confused by the `Remaining 40%`


Quote:
Originally Posted by H_TeXMeX_H
As for lilo.conf, I would add two sections in there one booting off of each disk. And for fstab, try to keep the drives not dependent on one another.
done, but I'm not sure why. It seems to me that if the hd fails, the BIOS wont read its MBR? besides which I won't be able to select an alternate lilo entry anyway.. (headless)

Quote:
Originally Posted by H_TeXMeX_H
This means the drive is getting really old and is reaching the end of it's life. Not necessarily that it will fail, but that its days of proper functioning are numbered.
Yes, I should've been more vigilant in running tests - but to be clear, this is what really has me concerned:

Code:
root@perrys:~# dmesg | grep dma
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x04 { DriveStatusError }
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x04 { DriveStatusError }
I have not seen that before. Also my old `hdparm` settings will no longer correctly apply to /dev/hda.

I checked `hdparm -d /dev/hda' and it reported dma is not in use - so I enabled it. Which worked according to hdparm, but nothing was reported in dmesg. Which seems counterintuitive.

Anyways thanks for your help & suggestions. This is becoming more of a curiosity and learning experience now. I'm sure I'll pick up a new HD soon ( no reason really to hold on to a ten year old 17GB hd anyway
 
Old 01-04-2008, 03:32 PM   #9
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292Reputation: 1292
Quote:
done, but I'm not sure why. It seems to me that if the hd fails, the BIOS wont read its MBR? besides which I won't be able to select an alternate lilo entry anyway.. (headless)
Hmmmm, yes that is a problem isn't it . And, I suppose installing to the MBR of the second HDD won't work ? I think it depends on BIOS as said before, some allow it some don't.

Quote:
( no reason really to hold on to a ten year old 17GB hd anyway
Yes, I agree. 17 GB is nothing nowadays and it's a ten year old drive, I think the error given by SMART is believable, the drive has reached the end of it's life.

Quote:
I'm still confused by the `Remaining 40%`
That means the test failed with 40% to go. It only went through 60% of the test before failing.
 
Old 01-04-2008, 05:40 PM   #10
dasy2k1
Member
 
Registered: Oct 2005
Location: 127.0.0.1
Distribution: Ubuntu 12.04 X86_64
Posts: 960

Rep: Reputation: 35
try running spinrite on it
 
Old 01-04-2008, 11:41 PM   #11
bioe007
Member
 
Registered: Apr 2006
Location: lynnwood, wa - usa
Distribution: archlinux
Posts: 654

Original Poster
Rep: Reputation: 30
Quote:
Originally Posted by H_TexMex_H
And, I suppose installing to the MBR of the second HDD won't work ?
i did install to /dev/hdb, but can't get into the bios in a headless setup. (crossing fingers that it will just work 8) )

Quote:
That means the test failed with 40% to go.
doh.. blush...

Quote:
Originally Posted by dasy2k1
try running spinrite on it
unless I'm missing something:

spinrite=89usd

100GB HD=60usd

?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Test Install Failed harley51 Fedora 1 03-30-2005 04:21 PM
I failed the media test!!!! slackr007 Fedora 22 01-09-2005 04:11 PM
2.6.x test kernel failed DAChristen29 Slackware 5 12-09-2004 03:52 PM
make test on perl5 failed quixy Programming 1 11-17-2004 02:58 PM
MySQL...failed test suite. jimmytango829 Linux - Software 1 12-02-2002 01:09 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 08:42 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration