LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Hardware (https://www.linuxquestions.org/questions/linux-hardware-18/)
-   -   how do I check my hard disk for errors. possible hard disk failure (https://www.linuxquestions.org/questions/linux-hardware-18/how-do-i-check-my-hard-disk-for-errors-possible-hard-disk-failure-887618/)

lgtrean 06-21-2011 05:41 PM

how do I check my hard disk for errors. possible hard disk failure
 
Hello,

I was using Terminal and browsing a directory in my home folder. My "home" directory is located on "/dev/sdb1".
When in Terminal I typed "ls" in one of my directories and the output was garbage. The output didn't show the files in the directory. I think it said something like, "input/output error". Unfortunately, I didn't write the exact error down. Instead I rebooted.

The hard disk with the problem is:
Code:

$ sudo hdparm -I /dev/sdb
[sudo] password for brian:

/dev/sdb:

ATA device, with non-removable media
        Model Number:      WDC WD5000KS-00MNB0                   
        Serial Number:      WD-WCANU1019633
        Firmware Revision:  07.02E07
Standards:
        Supported: 7 6 5 4
        Likely used: 8
Configuration:
        Logical                max        current
        cylinders        16383        16383
        heads                16        16
        sectors/track        63        63
        --
        CHS current addressable sectors:  16514064
        LBA    user addressable sectors:  268435455
        LBA48  user addressable sectors:  976773168
        Logical/Physical Sector size:          512 bytes
        device size with M = 1024*1024:      476940 MBytes
        device size with M = 1000*1000:      500107 MBytes (500 GB)
        cache/buffer size  = 16384 KBytes
Capabilities:
        LBA, IORDY(can be disabled)
        Queue depth: 32
        Standby timer values: spec'd by Standard, with device specific minimum
        R/W multiple sector transfer: Max = 16        Current = 8
        Recommended acoustic management value: 128, current value: 128
        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
            Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4
            Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
        Enabled        Supported:
          *        SMART feature set
                    Security Mode feature set
          *        Power Management feature set
          *        Write cache
          *        Look-ahead
          *        Host Protected Area feature set
          *        WRITE_BUFFER command
          *        READ_BUFFER command
          *        NOP cmd
          *        DOWNLOAD_MICROCODE
                    Power-Up In Standby feature set
          *        SET_FEATURES required to spinup after power up
                    SET_MAX security extension
          *        Automatic Acoustic Management feature set
          *        48-bit Address feature set
          *        Device Configuration Overlay feature set
          *        Mandatory FLUSH_CACHE
          *        FLUSH_CACHE_EXT
          *        SMART error logging
          *        SMART self-test
          *        General Purpose Logging feature set
          *        WRITE_{DMA|MULTIPLE}_FUA_EXT
          *        64-bit World wide name
          *        Gen1 signaling speed (1.5Gb/s)
          *        Gen2 signaling speed (3.0Gb/s)
          *        Native Command Queueing (NCQ)
          *        Host-initiated interface power management
          *        Phy event counters
          *        DMA Setup Auto-Activate optimization
          *        Software settings preservation
          *        SMART Command Transport (SCT) feature set
          *        SCT Long Sector Access (AC1)
          *        SCT LBA Segment Access (AC2)
          *        SCT Error Recovery Control (AC3)
          *        SCT Features Control (AC4)
          *        SCT Data Tables (AC5)
                    unknown 206[12] (vendor specific)
Security:
        Master password revision code = 65534
                supported
        not        enabled
        not        locked
                frozen
        not        expired: security count
        not        supported: enhanced erase
        138min for SECURITY ERASE UNIT.
Logical Unit WWN Device Identifier: 50014ee20002257a
        NAA                : 5
        IEEE OUI        : 0014ee
        Unique ID        : 20002257a
Checksum: correct

uname output:
Code:

$ uname -r
2.6.32-5-amd64

lsb_release output:
Code:

$ lsb_release -a
No LSB modules are available.
Distributor ID:        Debian
Description:        Debian GNU/Linux 6.0.1 (squeeze)
Release:        6.0.1
Codename:        squeeze

During the reboot my computer was unable to mount my "home" directory located on "/dev/sdb1".
But, I was able to see my other devices. During the reboot I saw a message that said something like "fsck unable to resolve: 'UUID=0f24fae1-135c-4750-9928-4632e2f04f45'". That's the UUID of my "home" directory located on "/dev/sdb1".

fstab output:
Code:

$ cat /etc/fstab
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point>  <type>  <options>      <dump>  <pass>
proc            /proc          proc    defaults        0      0
# / was on /dev/sda2 during installation
UUID=f13fe524-b8ae-4f35-8831-9ba9e9db2dfa /              ext4    errors=remount-ro 0      1
# /home was on /dev/sdb1 during installation
UUID=0f24fae1-135c-4750-9928-4632e2f04f45 /home          ext4    defaults        0      2
# /wd500 was on /dev/sdc1 during installation
UUID=1fba7d0c-e82c-4837-b1bd-6192d7dd3f88 /wd500          ext4    rw,user,exec        0      2
# /wdgiga was on /dev/sdc3 during installation
UUID=3baa432d-3480-402d-8df3-1b90dbc5f655 /wdgiga        ext4    rw,user,exec        0      2
# /wdtera was on /dev/sdc2 during installation
UUID=9a762875-5e0d-4edf-8bdd-8aaaea6403d5 /wdtera        ext4    rw,user,exec        0      2
# /xtraSpace was on /dev/sda3 during installation
UUID=654d5c57-129b-4e06-92f1-673a8b4bcf56 /xtraSpace      ext4    defaults        0      2
# swap was on /dev/sda1 during installation
UUID=f7fc69af-d475-44f1-87dd-63eb8ca0b7ed none            swap    sw              0      0
/dev/scd1      /media/cdrom0  udf,iso9660 user,noauto    0      0
/dev/scd0      /media/cdrom1  udf,iso9660 user,noauto    0      0
#/dev/sdc1      /media/usb0    auto    rw,user,noauto  0      0
#/dev/sdc2      /media/usb1    auto    rw,user,noauto  0      0
#/dev/sdc3      /media/usb2    auto    rw,user,noauto  0      0


I was able to boot but had no home directory (all my stuff was backed up). I decided to reboot using the SystemRescueCd (www.sysresccd.org). I ran "FSArchiver: Filesystem Archiver for Linux". You can see from the output that it didn't see my
"home" directory (which usually would mount on "/dev/sdb1").

The output is below:
Code:

=====================>>> fsarchiver probe simple <<<=====================
[======DISK======] [=============NAME==============] [====SIZE====] [MAJ] [MIN]
[sda            ] [WDC WD800JD-75MS              ] [    74.51 GB] [  8] [  0]
[sdb            ] [My Book 1130                  ] [    1.82 TB] [  8] [ 16]

[=====DEVICE=====] [==FILESYS==] [======LABEL======] [====SIZE====] [MAJ] [MIN]
[loop0          ] [squashfs  ] [<unknown>        ] [  265.55 MB] [  7] [  0]
[sda1            ] [swap      ] [<unknown>        ] [    1.53 GB] [  8] [  1]
[sda2            ] [ext4      ] [<unknown>        ] [    14.63 GB] [  8] [  2]
[sda3            ] [ext4      ] [<unknown>        ] [    58.35 GB] [  8] [  3]
[sdb1            ] [ext4      ] [wd500            ] [  499.37 GB] [  8] [ 17]
[sdb2            ] [ext4      ] [<unknown>        ] [    1.33 TB] [  8] [ 18]
[sdb3            ] [ext4      ] [<unknown>        ] [    1.00 GB] [  8] [ 19]

I also ran gparted but it didn't list my home directory. So, I figured my "home" directory (which usually would mount on "/dev/sdb1") was dead so I bought a replacement hard drive.

Then, I rebooted without using the SystemRescueCd and I saw this message scroll by, "/home: recovering journal".
I also saw that message when I looked in /var/log/fsck/checkfs:
Code:

$ cat /var/log/fsck/checkfs
Log of fsck -C -R -A -a
Tue Jun 21 15:51:12 2011

fsck from util-linux-ng 2.17.2
/dev/sda3: clean, 205/3825664 files, 15093148/15295744 blocks
wd500: clean, 201315/32727040 files, 118986212/130905644 blocks
/home: recovering journal
/dev/sdc3: clean, 12/65808 files, 12660/263064 blocks
/dev/sdc2: clean, 241520/89300992 files, 275299582/357201258 blocks
/home: Clearing orphaned inode 38781677 (uid=1000, gid=1000, mode=0100644, size=32768)
/home: Clearing orphaned inode 38780933 (uid=1000, gid=1000, mode=0100600, size=77192)
/home: clean, 202121/61063168 files, 119022216/122096000 blocks

Tue Jun 21 15:52:02 2011
----------------

And when I logged in my "home" directory located on "/dev/sdb1" was alive. Here's the current output of my disk space usage:
Code:


$ df -H
Filesystem            Size  Used  Avail Use% Mounted on
/dev/sda2              16G    12G  3.3G  79% /
tmpfs                  1.9G      0  1.9G  0% /lib/init/rw
udev                  1.9G  246k  1.9G  1% /dev
tmpfs                  1.9G  4.1k  1.9G  1% /dev/shm
/dev/sdb1              493G  472G    21G  96% /home
/dev/sdc1              528G  471G    57G  90% /wd500
/dev/sdc3              1.1G    35M  1.1G  4% /wdgiga
/dev/sdc2              1.5T  1.2T  336G  77% /wdtera
/dev/sda3              62G    61G  830M  99% /xtraSpace


Below is very truncated output of a small portion from /var/log/messages that might be referring to the device that had problems ("/home on /dev/sdb1"). I don't know if it will be useful:
Code:

Jun 21 12:03:12 kub nagios3: Auto-save of retention data completed successfully.
Jun 21 12:31:24 kub kernel: [59330.816096] ata4: hard resetting link
Jun 21 12:31:29 kub kernel: [59336.180017] ata4: link is slow to respond, please be patient (ready=0)
Jun 21 12:31:34 kub kernel: [59340.828034] ata4: hard resetting link
Jun 21 12:31:39 kub kernel: [59346.188034] ata4: link is slow to respond, please be patient (ready=0)
Jun 21 12:31:44 kub kernel: [59350.836041] ata4: hard resetting link
Jun 21 12:31:49 kub kernel: [59356.196016] ata4: link is slow to respond, please be patient (ready=0)
Jun 21 12:32:19 kub kernel: [59385.876034] ata4: limiting SATA link speed to 1.5 Gbps
Jun 21 12:32:19 kub kernel: [59385.876039] ata4: hard resetting link
Jun 21 12:32:24 kub kernel: [59390.900032] ata4.00: disabled
Jun 21 12:32:24 kub kernel: [59390.900040] ata4.00: device reported invalid CHS sector 0
Jun 21 12:32:24 kub kernel: [59390.900044] ata4.00: device reported invalid CHS sector 0
Jun 21 12:32:24 kub kernel: [59390.900062] ata4: EH complete
Jun 21 12:32:24 kub kernel: [59390.900090] sd 3:0:0:0: [sdb] Unhandled error code
Jun 21 12:32:24 kub kernel: [59390.900093] sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 21 12:32:24 kub kernel: [59390.900099] sd 3:0:0:0: [sdb] CDB: Read(10): 28 00 26 c8 48 1f 00 01 00 00
Jun 21 12:32:24 kub kernel: [59390.900139] sd 3:0:0:0: [sdb] Unhandled error code
Jun 21 12:32:24 kub kernel: [59390.900142] sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 21 12:32:24 kub kernel: [59390.900146] sd 3:0:0:0: [sdb] CDB: Read(10): 28 00 26 c8 47 1f 00 01 00 00
Jun 21 12:32:24 kub kernel: [59390.998653] sd 3:0:0:0: [sdb] Unhandled error code
Jun 21 12:32:24 kub kernel: [59390.998659] sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 21 12:32:24 kub kernel: [59390.998665] sd 3:0:0:0: [sdb] CDB: Read(10): 28 00 0a a2 13 2f 00 00 08 00
Jun 21 12:32:30 kub kernel: [59396.804145] sd 3:0:0:0: [sdb] Unhandled error code
Jun 21 12:32:30 kub kernel: [59396.804151] sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 21 12:32:30 kub kernel: [59396.804157] sd 3:0:0:0: [sdb] CDB: Write(10): 2a 00 11 f2 46 bf 00 00 08 00
Jun 21 12:32:30 kub kernel: [59396.804181] lost page write due to I/O error on sdb1
Jun 21 12:32:30 kub kernel: [59396.804201] JBD2: Detected IO errors while flushing file data on sdb1-8
Jun 21 12:32:30 kub kernel: [59396.804213] sd 3:0:0:0: [sdb] Unhandled error code
Jun 21 12:32:30 kub kernel: [59396.804337] sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 21 12:32:30 kub kernel: [59396.804342] sd 3:0:0:0: [sdb] CDB: Write(10): 2a 00 00 00 31 77 00 00 08 00
Jun 21 12:32:30 kub kernel: [59396.804363] lost page write due to I/O error on sdb1

My question is what should I do now? Using Linux how do I check my hard disk for errors?

Thank you for your advice.

TobiSGD 06-21-2011 05:44 PM

Download the manufacturers diagnosis tool, burn it to CD and test the drive.

John VV 06-21-2011 06:25 PM

a western digital drive
the only problem is there tool is MS Windows ONLY
install win7 then use there tool to test .

"/." and "ars tech" and i think "phoronix" had news on that a while back .

TobiSGD 06-21-2011 06:30 PM

Quote:

Originally Posted by John VV (Post 4392215)
the only problem is there tool is MS Windows ONLY

Wrong, you can get the bootable ISO with a DOS based test utility here.
No need for Windows at all.

Soadyheid 06-22-2011 08:15 AM

@ JohnVV
Quote:

the only problem is there tool is MS Windows ONLY
install win7 then use there tool to test .
Sorry John, I don't mean to be cheeky or rude... Text speak causes problems in understanding posts, so does using the wrong word. :(
Code:

there = over there, their = belonging to them, they're = 'they are'
Play Bonny! :hattip:

H_TeXMeX_H 06-22-2011 09:11 AM

You can run either the manufacturers utility:
http://www.ultimatebootcd.com/

or you can run 'smartctl -t long /dev/sdb', wait for it to finish, then check the result.

Personally, if I see errors like that and have never run SMART tests and the drive is very young or old, I would backup data ASAP.

lgtrean 06-22-2011 04:06 PM

Thank you everyone.

I downloaded the Data Lifeguard Diagnostic for DOS (CD) from wdc.com and ran all the tests. The disk passed the tests.

I wonder what caused the problem to begin with.


All times are GMT -5. The time now is 01:57 AM.