how do I check my hard disk for errors. possible hard disk failure

lgtrean · 06-21-2011, 05:41 PM

Hello,

I was using Terminal and browsing a directory in my home folder. My "home" directory is located on "/dev/sdb1".
When in Terminal I typed "ls" in one of my directories and the output was garbage. The output didn't show the files in the directory. I think it said something like, "input/output error". Unfortunately, I didn't write the exact error down. Instead I rebooted.

The hard disk with the problem is:

Code:

$ sudo hdparm -I /dev/sdb
[sudo] password for brian: 

/dev/sdb:

ATA device, with non-removable media
	Model Number:       WDC WD5000KS-00MNB0                     
	Serial Number:      WD-WCANU1019633
	Firmware Revision:  07.02E07
Standards:
	Supported: 7 6 5 4 
	Likely used: 8
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:   16514064
	LBA    user addressable sectors:  268435455
	LBA48  user addressable sectors:  976773168
	Logical/Physical Sector size:           512 bytes
	device size with M = 1024*1024:      476940 MBytes
	device size with M = 1000*1000:      500107 MBytes (500 GB)
	cache/buffer size  = 16384 KBytes
Capabilities:
	LBA, IORDY(can be disabled)
	Queue depth: 32
	Standby timer values: spec'd by Standard, with device specific minimum
	R/W multiple sector transfer: Max = 16	Current = 8
	Recommended acoustic management value: 128, current value: 128
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	Host Protected Area feature set
	   *	WRITE_BUFFER command
	   *	READ_BUFFER command
	   *	NOP cmd
	   *	DOWNLOAD_MICROCODE
	    	Power-Up In Standby feature set
	   *	SET_FEATURES required to spinup after power up
	    	SET_MAX security extension
	   *	Automatic Acoustic Management feature set
	   *	48-bit Address feature set
	   *	Device Configuration Overlay feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART error logging
	   *	SMART self-test
	   *	General Purpose Logging feature set
	   *	WRITE_{DMA|MULTIPLE}_FUA_EXT
	   *	64-bit World wide name
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Gen2 signaling speed (3.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Host-initiated interface power management
	   *	Phy event counters
	   *	DMA Setup Auto-Activate optimization
	   *	Software settings preservation
	   *	SMART Command Transport (SCT) feature set
	   *	SCT Long Sector Access (AC1)
	   *	SCT LBA Segment Access (AC2)
	   *	SCT Error Recovery Control (AC3)
	   *	SCT Features Control (AC4)
	   *	SCT Data Tables (AC5)
	    	unknown 206[12] (vendor specific)
Security: 
	Master password revision code = 65534
		supported
	not	enabled
	not	locked
		frozen
	not	expired: security count
	not	supported: enhanced erase
	138min for SECURITY ERASE UNIT. 
Logical Unit WWN Device Identifier: 50014ee20002257a
	NAA		: 5
	IEEE OUI	: 0014ee
	Unique ID	: 20002257a
Checksum: correct

uname output:

Code:

$ uname -r
2.6.32-5-amd64

lsb_release output:

Code:

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 6.0.1 (squeeze)
Release:	6.0.1
Codename:	squeeze

During the reboot my computer was unable to mount my "home" directory located on "/dev/sdb1".
But, I was able to see my other devices. During the reboot I saw a message that said something like "fsck unable to resolve: 'UUID=0f24fae1-135c-4750-9928-4632e2f04f45'". That's the UUID of my "home" directory located on "/dev/sdb1".

fstab output:

Code:

$ cat /etc/fstab
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
proc            /proc           proc    defaults        0       0
# / was on /dev/sda2 during installation
UUID=f13fe524-b8ae-4f35-8831-9ba9e9db2dfa /               ext4    errors=remount-ro 0       1
# /home was on /dev/sdb1 during installation
UUID=0f24fae1-135c-4750-9928-4632e2f04f45 /home           ext4    defaults        0       2
# /wd500 was on /dev/sdc1 during installation
UUID=1fba7d0c-e82c-4837-b1bd-6192d7dd3f88 /wd500          ext4    rw,user,exec        0       2
# /wdgiga was on /dev/sdc3 during installation
UUID=3baa432d-3480-402d-8df3-1b90dbc5f655 /wdgiga         ext4    rw,user,exec        0       2
# /wdtera was on /dev/sdc2 during installation
UUID=9a762875-5e0d-4edf-8bdd-8aaaea6403d5 /wdtera         ext4    rw,user,exec        0       2
# /xtraSpace was on /dev/sda3 during installation
UUID=654d5c57-129b-4e06-92f1-673a8b4bcf56 /xtraSpace      ext4    defaults        0       2
# swap was on /dev/sda1 during installation
UUID=f7fc69af-d475-44f1-87dd-63eb8ca0b7ed none            swap    sw              0       0
/dev/scd1       /media/cdrom0   udf,iso9660 user,noauto     0       0
/dev/scd0       /media/cdrom1   udf,iso9660 user,noauto     0       0
#/dev/sdc1       /media/usb0     auto    rw,user,noauto  0       0
#/dev/sdc2       /media/usb1     auto    rw,user,noauto  0       0
#/dev/sdc3       /media/usb2     auto    rw,user,noauto  0       0

I was able to boot but had no home directory (all my stuff was backed up). I decided to reboot using the SystemRescueCd (www.sysresccd.org). I ran "FSArchiver: Filesystem Archiver for Linux". You can see from the output that it didn't see my
"home" directory (which usually would mount on "/dev/sdb1").

The output is below:

Code:

=====================>>> fsarchiver probe simple <<<=====================
[======DISK======] [=============NAME==============] [====SIZE====] [MAJ] [MIN]
[sda            ] [WDC WD800JD-75MS              ] [    74.51 GB] [  8] [  0]
[sdb            ] [My Book 1130                  ] [    1.82 TB] [  8] [ 16]

[=====DEVICE=====] [==FILESYS==] [======LABEL======] [====SIZE====] [MAJ] [MIN]
[loop0          ] [squashfs  ] [<unknown>        ] [  265.55 MB] [  7] [  0]
[sda1            ] [swap      ] [<unknown>        ] [    1.53 GB] [  8] [  1]
[sda2            ] [ext4      ] [<unknown>        ] [    14.63 GB] [  8] [  2]
[sda3            ] [ext4      ] [<unknown>        ] [    58.35 GB] [  8] [  3]
[sdb1            ] [ext4      ] [wd500            ] [  499.37 GB] [  8] [ 17]
[sdb2            ] [ext4      ] [<unknown>        ] [    1.33 TB] [  8] [ 18]
[sdb3            ] [ext4      ] [<unknown>        ] [    1.00 GB] [  8] [ 19]

I also ran gparted but it didn't list my home directory. So, I figured my "home" directory (which usually would mount on "/dev/sdb1") was dead so I bought a replacement hard drive.

Then, I rebooted without using the SystemRescueCd and I saw this message scroll by, "/home: recovering journal".
I also saw that message when I looked in /var/log/fsck/checkfs:

Code:

$ cat /var/log/fsck/checkfs 
Log of fsck -C -R -A -a 
Tue Jun 21 15:51:12 2011

fsck from util-linux-ng 2.17.2
/dev/sda3: clean, 205/3825664 files, 15093148/15295744 blocks
wd500: clean, 201315/32727040 files, 118986212/130905644 blocks
/home: recovering journal
/dev/sdc3: clean, 12/65808 files, 12660/263064 blocks
/dev/sdc2: clean, 241520/89300992 files, 275299582/357201258 blocks
/home: Clearing orphaned inode 38781677 (uid=1000, gid=1000, mode=0100644, size=32768)
/home: Clearing orphaned inode 38780933 (uid=1000, gid=1000, mode=0100600, size=77192)
/home: clean, 202121/61063168 files, 119022216/122096000 blocks

Tue Jun 21 15:52:02 2011
----------------

And when I logged in my "home" directory located on "/dev/sdb1" was alive. Here's the current output of my disk space usage:

Code:

 
$ df -H
Filesystem             Size   Used  Avail Use% Mounted on
/dev/sda2               16G    12G   3.3G  79% /
tmpfs                  1.9G      0   1.9G   0% /lib/init/rw
udev                   1.9G   246k   1.9G   1% /dev
tmpfs                  1.9G   4.1k   1.9G   1% /dev/shm
/dev/sdb1              493G   472G    21G  96% /home
/dev/sdc1              528G   471G    57G  90% /wd500
/dev/sdc3              1.1G    35M   1.1G   4% /wdgiga
/dev/sdc2              1.5T   1.2T   336G  77% /wdtera
/dev/sda3               62G    61G   830M  99% /xtraSpace

Below is very truncated output of a small portion from /var/log/messages that might be referring to the device that had problems ("/home on /dev/sdb1"). I don't know if it will be useful:

Code:

Jun 21 12:03:12 kub nagios3: Auto-save of retention data completed successfully.
Jun 21 12:31:24 kub kernel: [59330.816096] ata4: hard resetting link
Jun 21 12:31:29 kub kernel: [59336.180017] ata4: link is slow to respond, please be patient (ready=0)
Jun 21 12:31:34 kub kernel: [59340.828034] ata4: hard resetting link
Jun 21 12:31:39 kub kernel: [59346.188034] ata4: link is slow to respond, please be patient (ready=0)
Jun 21 12:31:44 kub kernel: [59350.836041] ata4: hard resetting link
Jun 21 12:31:49 kub kernel: [59356.196016] ata4: link is slow to respond, please be patient (ready=0)
Jun 21 12:32:19 kub kernel: [59385.876034] ata4: limiting SATA link speed to 1.5 Gbps
Jun 21 12:32:19 kub kernel: [59385.876039] ata4: hard resetting link
Jun 21 12:32:24 kub kernel: [59390.900032] ata4.00: disabled
Jun 21 12:32:24 kub kernel: [59390.900040] ata4.00: device reported invalid CHS sector 0
Jun 21 12:32:24 kub kernel: [59390.900044] ata4.00: device reported invalid CHS sector 0
Jun 21 12:32:24 kub kernel: [59390.900062] ata4: EH complete
Jun 21 12:32:24 kub kernel: [59390.900090] sd 3:0:0:0: [sdb] Unhandled error code
Jun 21 12:32:24 kub kernel: [59390.900093] sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 21 12:32:24 kub kernel: [59390.900099] sd 3:0:0:0: [sdb] CDB: Read(10): 28 00 26 c8 48 1f 00 01 00 00
Jun 21 12:32:24 kub kernel: [59390.900139] sd 3:0:0:0: [sdb] Unhandled error code
Jun 21 12:32:24 kub kernel: [59390.900142] sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 21 12:32:24 kub kernel: [59390.900146] sd 3:0:0:0: [sdb] CDB: Read(10): 28 00 26 c8 47 1f 00 01 00 00
Jun 21 12:32:24 kub kernel: [59390.998653] sd 3:0:0:0: [sdb] Unhandled error code
Jun 21 12:32:24 kub kernel: [59390.998659] sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 21 12:32:24 kub kernel: [59390.998665] sd 3:0:0:0: [sdb] CDB: Read(10): 28 00 0a a2 13 2f 00 00 08 00
Jun 21 12:32:30 kub kernel: [59396.804145] sd 3:0:0:0: [sdb] Unhandled error code
Jun 21 12:32:30 kub kernel: [59396.804151] sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 21 12:32:30 kub kernel: [59396.804157] sd 3:0:0:0: [sdb] CDB: Write(10): 2a 00 11 f2 46 bf 00 00 08 00
Jun 21 12:32:30 kub kernel: [59396.804181] lost page write due to I/O error on sdb1
Jun 21 12:32:30 kub kernel: [59396.804201] JBD2: Detected IO errors while flushing file data on sdb1-8
Jun 21 12:32:30 kub kernel: [59396.804213] sd 3:0:0:0: [sdb] Unhandled error code
Jun 21 12:32:30 kub kernel: [59396.804337] sd 3:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 21 12:32:30 kub kernel: [59396.804342] sd 3:0:0:0: [sdb] CDB: Write(10): 2a 00 00 00 31 77 00 00 08 00
Jun 21 12:32:30 kub kernel: [59396.804363] lost page write due to I/O error on sdb1

My question is what should I do now? Using Linux how do I check my hard disk for errors?

Thank you for your advice.

TobiSGD · 06-21-2011, 05:44 PM

Download the manufacturers diagnosis tool, burn it to CD and test the drive.

John VV · 06-21-2011, 06:25 PM

a western digital drive
the only problem is there tool is MS Windows ONLY
install win7 then use there tool to test .

"/." and "ars tech" and i think "phoronix" had news on that a while back .

TobiSGD · 06-21-2011, 06:30 PM

Quote:

Originally Posted by John VV

the only problem is there tool is MS Windows ONLY

Wrong, you can get the bootable ISO with a DOS based test utility here.
No need for Windows at all.

Soadyheid · 06-22-2011, 08:15 AM

@ JohnVV

Quote:

the only problem is there tool is MS Windows ONLY
install win7 then use there tool to test .

Sorry John, I don't mean to be cheeky or rude... Text speak causes problems in understanding posts, so does using the wrong word.

Code:

there = over there, their = belonging to them, they're = 'they are'

Play Bonny!

H_TeXMeX_H · 06-22-2011, 09:11 AM

You can run either the manufacturers utility:
http://www.ultimatebootcd.com/

or you can run 'smartctl -t long /dev/sdb', wait for it to finish, then check the result.

Personally, if I see errors like that and have never run SMART tests and the drive is very young or old, I would backup data ASAP.

lgtrean · 06-22-2011, 04:06 PM

Thank you everyone.

I downloaded the Data Lifeguard Diagnostic for DOS (CD) from wdc.com and ran all the tests. The disk passed the tests.

I wonder what caused the problem to begin with.