LinuxQuestions.org - how do I check my hard disk for errors. possible hard disk failure

Hello,

I was using Terminal and browsing a directory in my home folder. My "home" directory is located on "/dev/sdb1".
When in Terminal I typed "ls" in one of my directories and the output was garbage. The output didn't show the files in the directory. I think it said something like, "input/output error". Unfortunately, I didn't write the exact error down. Instead I rebooted.

The hard disk with the problem is:

Code:

$ sudo hdparm -I /dev/sdb

[sudo] password for brian: 



/dev/sdb:



ATA device, with non-removable media

        Model Number:      WDC WD5000KS-00MNB0                    

        Serial Number:      WD-WCANU1019633

        Firmware Revision:  07.02E07

Standards:

        Supported: 7 6 5 4 

        Likely used: 8

Configuration:

        Logical                max        current

        cylinders        16383        16383

        heads                16        16

        sectors/track        63        63

        --

        CHS current addressable sectors:  16514064

        LBA    user addressable sectors:  268435455

        LBA48  user addressable sectors:  976773168

        Logical/Physical Sector size:          512 bytes

        device size with M = 1024*1024:      476940 MBytes

        device size with M = 1000*1000:      500107 MBytes (500 GB)

        cache/buffer size  = 16384 KBytes

Capabilities:

        LBA, IORDY(can be disabled)

        Queue depth: 32

        Standby timer values: spec'd by Standard, with device specific minimum

        R/W multiple sector transfer: Max = 16        Current = 8

        Recommended acoustic management value: 128, current value: 128

        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 

            Cycle time: min=120ns recommended=120ns

        PIO: pio0 pio1 pio2 pio3 pio4 

            Cycle time: no flow control=120ns  IORDY flow control=120ns

Commands/features:

        Enabled        Supported:

          *        SMART feature set

                    Security Mode feature set

          *        Power Management feature set

          *        Write cache

          *        Look-ahead

          *        Host Protected Area feature set

          *        WRITE_BUFFER command

          *        READ_BUFFER command

          *        NOP cmd

          *        DOWNLOAD_MICROCODE

                    Power-Up In Standby feature set

          *        SET_FEATURES required to spinup after power up

                    SET_MAX security extension

          *        Automatic Acoustic Management feature set

          *        48-bit Address feature set

          *        Device Configuration Overlay feature set

          *        Mandatory FLUSH_CACHE

          *        FLUSH_CACHE_EXT

          *        SMART error logging

          *        SMART self-test

          *        General Purpose Logging feature set

          *        WRITE_{DMA|MULTIPLE}_FUA_EXT

          *        64-bit World wide name

          *        Gen1 signaling speed (1.5Gb/s)

          *        Gen2 signaling speed (3.0Gb/s)

          *        Native Command Queueing (NCQ)

          *        Host-initiated interface power management

          *        Phy event counters

          *        DMA Setup Auto-Activate optimization

          *        Software settings preservation

          *        SMART Command Transport (SCT) feature set

          *        SCT Long Sector Access (AC1)

          *        SCT LBA Segment Access (AC2)

          *        SCT Error Recovery Control (AC3)

          *        SCT Features Control (AC4)

          *        SCT Data Tables (AC5)

                    unknown 206[12] (vendor specific)

Security: 

        Master password revision code = 65534

                supported

        not        enabled

        not        locked

                frozen

        not        expired: security count

        not        supported: enhanced erase

        138min for SECURITY ERASE UNIT. 

Logical Unit WWN Device Identifier: 50014ee20002257a

        NAA                : 5

        IEEE OUI        : 0014ee

        Unique ID        : 20002257a

Checksum: correct

uname output:

Code:

$ uname -r

2.6.32-5-amd64

lsb_release output:

Code:

$ lsb_release -a

No LSB modules are available.

Distributor ID:        Debian

Description:        Debian GNU/Linux 6.0.1 (squeeze)

Release:        6.0.1

Codename:        squeeze

During the reboot my computer was unable to mount my "home" directory located on "/dev/sdb1".
But, I was able to see my other devices. During the reboot I saw a message that said something like "fsck unable to resolve: 'UUID=0f24fae1-135c-4750-9928-4632e2f04f45'". That's the UUID of my "home" directory located on "/dev/sdb1".

fstab output:

Code:

$ cat /etc/fstab

# /etc/fstab: static file system information.

#

# Use 'blkid' to print the universally unique identifier for a

# device; this may be used with UUID= as a more robust way to name devices

# that works even if disks are added and removed. See fstab(5).

#

# <file system> <mount point>  <type>  <options>      <dump>  <pass>

proc            /proc          proc    defaults        0      0

# / was on /dev/sda2 during installation

UUID=f13fe524-b8ae-4f35-8831-9ba9e9db2dfa /              ext4    errors=remount-ro 0      1

# /home was on /dev/sdb1 during installation

UUID=0f24fae1-135c-4750-9928-4632e2f04f45 /home          ext4    defaults        0      2

# /wd500 was on /dev/sdc1 during installation

UUID=1fba7d0c-e82c-4837-b1bd-6192d7dd3f88 /wd500          ext4    rw,user,exec        0      2

# /wdgiga was on /dev/sdc3 during installation

UUID=3baa432d-3480-402d-8df3-1b90dbc5f655 /wdgiga        ext4    rw,user,exec        0      2

# /wdtera was on /dev/sdc2 during installation

UUID=9a762875-5e0d-4edf-8bdd-8aaaea6403d5 /wdtera        ext4    rw,user,exec        0      2

# /xtraSpace was on /dev/sda3 during installation

UUID=654d5c57-129b-4e06-92f1-673a8b4bcf56 /xtraSpace      ext4    defaults        0      2

# swap was on /dev/sda1 during installation

UUID=f7fc69af-d475-44f1-87dd-63eb8ca0b7ed none            swap    sw              0      0

/dev/scd1      /media/cdrom0  udf,iso9660 user,noauto    0      0

/dev/scd0      /media/cdrom1  udf,iso9660 user,noauto    0      0

#/dev/sdc1      /media/usb0    auto    rw,user,noauto  0      0

#/dev/sdc2      /media/usb1    auto    rw,user,noauto  0      0

#/dev/sdc3      /media/usb2    auto    rw,user,noauto  0      0

I was able to boot but had no home directory (all my stuff was backed up). I decided to reboot using the SystemRescueCd (www.sysresccd.org). I ran "FSArchiver: Filesystem Archiver for Linux". You can see from the output that it didn't see my
"home" directory (which usually would mount on "/dev/sdb1").

The output is below:

Code:

=====================>>> fsarchiver probe simple <<<=====================

[======DISK======] [=============NAME==============] [====SIZE====] [MAJ] [MIN]

[sda            ] [WDC WD800JD-75MS              ] [    74.51 GB] [  8] [  0]

[sdb            ] [My Book 1130                  ] [    1.82 TB] [  8] [ 16]



[=====DEVICE=====] [==FILESYS==] [======LABEL======] [====SIZE====] [MAJ] [MIN]

[loop0          ] [squashfs  ] [<unknown>        ] [  265.55 MB] [  7] [  0]

[sda1            ] [swap      ] [<unknown>        ] [    1.53 GB] [  8] [  1]

[sda2            ] [ext4      ] [<unknown>        ] [    14.63 GB] [  8] [  2]

[sda3            ] [ext4      ] [<unknown>        ] [    58.35 GB] [  8] [  3]

[sdb1            ] [ext4      ] [wd500            ] [  499.37 GB] [  8] [ 17]

[sdb2            ] [ext4      ] [<unknown>        ] [    1.33 TB] [  8] [ 18]

[sdb3            ] [ext4      ] [<unknown>        ] [    1.00 GB] [  8] [ 19]

I also ran gparted but it didn't list my home directory. So, I figured my "home" directory (which usually would mount on "/dev/sdb1") was dead so I bought a replacement hard drive.

Then, I rebooted without using the SystemRescueCd and I saw this message scroll by, "/home: recovering journal".
I also saw that message when I looked in /var/log/fsck/checkfs:

Code:

$ cat /var/log/fsck/checkfs 

Log of fsck -C -R -A -a 

Tue Jun 21 15:51:12 2011



fsck from util-linux-ng 2.17.2

/dev/sda3: clean, 205/3825664 files, 15093148/15295744 blocks

wd500: clean, 201315/32727040 files, 118986212/130905644 blocks

/home: recovering journal

/dev/sdc3: clean, 12/65808 files, 12660/263064 blocks

/dev/sdc2: clean, 241520/89300992 files, 275299582/357201258 blocks

/home: Clearing orphaned inode 38781677 (uid=1000, gid=1000, mode=0100644, size=32768)

/home: Clearing orphaned inode 38780933 (uid=1000, gid=1000, mode=0100600, size=77192)

/home: clean, 202121/61063168 files, 119022216/122096000 blocks



Tue Jun 21 15:52:02 2011

----------------

And when I logged in my "home" directory located on "/dev/sdb1" was alive. Here's the current output of my disk space usage:

Code:

 

$ df -H

Filesystem            Size  Used  Avail Use% Mounted on

/dev/sda2              16G    12G  3.3G  79% /

tmpfs                  1.9G      0  1.9G  0% /lib/init/rw

udev                  1.9G  246k  1.9G  1% /dev

tmpfs                  1.9G  4.1k  1.9G  1% /dev/shm

/dev/sdb1              493G  472G    21G  96% /home

/dev/sdc1              528G  471G    57G  90% /wd500

/dev/sdc3              1.1G    35M  1.1G  4% /wdgiga

/dev/sdc2              1.5T  1.2T  336G  77% /wdtera

/dev/sda3              62G    61G  830M  99% /xtraSpace

Below is very truncated output of a small portion from /var/log/messages that might be referring to the device that had problems ("/home on /dev/sdb1"). I don't know if it will be useful:

Code:

Jun 21 12:03:12 kub nagios3: Auto-save of retention data completed successfully.

Jun 21 12:31:24 kub kernel: [59330.816096] ata4: hard resetting link

Jun 21 12:31:29 kub kernel: [59336.180017] ata4: link is slow to respond, please be patient (ready=0)

Jun 21 12:31:34 kub kernel: [59340.828034] ata4: hard resetting link

Jun 21 12:31:39 kub kernel: [59346.188034] ata4: link is slow to respond, please be patient (ready=0)

Jun 21 12:31:44 kub kernel: [59350.836041] ata4: hard resetting link

Jun 21 12:31:49 kub kernel: [59356.196016] ata4: link is slow to respond, please be patient (ready=0)

Jun 21 12:32:19 kub kernel: [59385.876034] ata4: limiting SATA link speed to 1.5 Gbps

Jun 21 12:32:19 kub kernel: [59385.876039] ata4: hard resetting link

Jun 21 12:32:24 kub kernel: [59390.900032] ata4.00: disabled

Jun 21 12:32:24 kub kernel: [59390.900040] ata4.00: device reported invalid CHS sector 0

Jun 21 12:32:24 kub kernel: [59390.900044] ata4.00: device reported invalid CHS sector 0

Jun 21 12:32:24 kub kernel: [59390.900062] ata4: EH complete

Jun 21 12:32:24 kub kernel: [59390.900090] sd 3:0:0:0: [sdb] Unhandled error code

Jun 21 12:32:24 kub kernel: [59390.900093] sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

Jun 21 12:32:24 kub kernel: [59390.900099] sd 3:0:0:0: [sdb] CDB: Read(10): 28 00 26 c8 48 1f 00 01 00 00

Jun 21 12:32:24 kub kernel: [59390.900139] sd 3:0:0:0: [sdb] Unhandled error code

Jun 21 12:32:24 kub kernel: [59390.900142] sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

Jun 21 12:32:24 kub kernel: [59390.900146] sd 3:0:0:0: [sdb] CDB: Read(10): 28 00 26 c8 47 1f 00 01 00 00

Jun 21 12:32:24 kub kernel: [59390.998653] sd 3:0:0:0: [sdb] Unhandled error code

Jun 21 12:32:24 kub kernel: [59390.998659] sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

Jun 21 12:32:24 kub kernel: [59390.998665] sd 3:0:0:0: [sdb] CDB: Read(10): 28 00 0a a2 13 2f 00 00 08 00

Jun 21 12:32:30 kub kernel: [59396.804145] sd 3:0:0:0: [sdb] Unhandled error code

Jun 21 12:32:30 kub kernel: [59396.804151] sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

Jun 21 12:32:30 kub kernel: [59396.804157] sd 3:0:0:0: [sdb] CDB: Write(10): 2a 00 11 f2 46 bf 00 00 08 00

Jun 21 12:32:30 kub kernel: [59396.804181] lost page write due to I/O error on sdb1

Jun 21 12:32:30 kub kernel: [59396.804201] JBD2: Detected IO errors while flushing file data on sdb1-8

Jun 21 12:32:30 kub kernel: [59396.804213] sd 3:0:0:0: [sdb] Unhandled error code

Jun 21 12:32:30 kub kernel: [59396.804337] sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

Jun 21 12:32:30 kub kernel: [59396.804342] sd 3:0:0:0: [sdb] CDB: Write(10): 2a 00 00 00 31 77 00 00 08 00

Jun 21 12:32:30 kub kernel: [59396.804363] lost page write due to I/O error on sdb1

My question is what should I do now? Using Linux how do I check my hard disk for errors?

Thank you for your advice.