LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Hardware (https://www.linuxquestions.org/questions/linux-hardware-18/)
-   -   Bad Sector won't go away (https://www.linuxquestions.org/questions/linux-hardware-18/bad-sector-wont-go-away-4175504574/)

davcefai 05-10-2014 12:53 PM

Bad Sector won't go away
 
On a Debian system - Unstable, AMD 10 6800K CPU, 4 GB RAM, 2 x 160GB SATA HDDS, 2 SATA DVD writers - I am getting the daily message:

Quote:

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors

Device info:
Maxtor 6L160M0, S/N:L407W4QH, FW:BANC1E00, 163 GB
When I run a short test
Code:

smartctl -t short -d sat /dev/sda
and then look at the result with
Code:

smartctl -l selftest /dev/sda
I always get the same result:
Quote:

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 60% 48224 144701458
# 2 Short offline Completed: read failure 60% 48224 144701458
# 3 Short offline Completed: read failure 60% 48164 144701458
# 4 Short offline Completed: read failure 60% 48163 144701458
# 5 Short offline Completed: read failure 60% 48163 144701458
This is the partition table printed by fdisk:

Quote:

Command (m for help): p

Disk /dev/sda: 163.9 GB, 163928604672 bytes
255 heads, 63 sectors/track, 19929 cylinders, total 320173056 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0005fe80

Device Boot Start End Blocks Id System
/dev/sda1 * 63 39070079 19535008+ 83 Linux
/dev/sda2 39070080 320159384 140544652+ 5 Extended
/dev/sda5 39070143 164071844 62500851 83 Linux
/dev/sda6 164071908 203141924 19535008+ 83 Linux
/dev/sda7 203141988 310568579 53713296 83 Linux
/dev/sda8 310568643 315259559 2345458+ 83 Linux
/dev/sda9 315259623 320159384 2449881 82 Linux swap / Solaris
I am interpreting this that my problem lies in sda5.

However, booting with Knoppix and running
Code:

e2fsck -c -f -k -p /dev/sda5
only results in a message "Updating inode table" (or something similar) and the next day I get the SMART warning again.

Should e2fsck have cured this?

Help appreciated.

Emerson 05-10-2014 01:35 PM

Your hard drive is a toast, it is not passing the test. Order a new one NOW. And make sure your backups are current.

rknichols 05-10-2014 09:57 PM

Please post the output from "smartctl -A /dev/sda". The problem could be a simple as a single bad sector which just needs to be written to so that the drive can reallocate it to a spare sector. That is _only_ going to happen when a write to that sector occurs unless at some point the drive does manage to get a correct read from that sector and so can reallocate it on its own. Bad sectors that are pending reallocation will cause some offline tests to fail.

Assuming that the problem is just some small number of bad sectors, the Bad Block HOWTO shows the procedure for finding them, determining what file they are (or are not) part of, and making the drive reallocate them. If there are just a small number of bad sectors and this number is not increasing with time, then the drive is OK to use. There are various events such as vibration or power supply glitches that can cause a sector to become bad without being a warning of impending doom.

Good backups are, of course, always important. Drives can and do fail without warning.

Emerson 05-10-2014 10:20 PM

Code:

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 60% 48224 144701458

Do not get confused, the test was not completed, it is a failure, the drive is dead.
Code:

smartctl --all /dev/sda | grep -e "Reallocated_Sector_Ct" -e "Current_Pending_Sector" -e "Offline_Uncorrectable" -e "UDMA_CRC_Error_Count" -e "Hardware_ECC_Recovered"
Above is for sda, the info you should be looking at.

davcefai 05-11-2014 02:08 AM

Feedback as requested:

Quote:

Please post the output from "smartctl -A /dev/sda".
I am assuming that Item 5 is the problem which is why I am reluctant to just dump the drive on this basis. sda5 is /home which is backed up daily.

Code:

davcefai:/home/david# smartctl -A /dev/sda
smartctl 6.2 2013-07-26 r3841 [i686-linux-3.14-1-686-pae] (local build)                 
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org             
                                                                                               
Warning! SMART Attribute Thresholds Structure error: invalid SMART checksum.                           
=== START OF READ SMART DATA SECTION ===                                                               
SMART Attributes Data Structure revision number: 16                                                     
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time            0x0027  207  205  063    Pre-fail  Always      -      10075
  4 Start_Stop_Count        0x0032  251  251  000    Old_age  Always      -      4903
  5 Reallocated_Sector_Ct  0x0033  253  253  063    Pre-fail  Always      -      1
  6 Read_Channel_Margin    0x0001  253  253  100    Pre-fail  Offline      -      0
  7 Seek_Error_Rate        0x000a  253  252  000    Old_age  Always      -      0
  8 Seek_Time_Performance  0x0027  247  232  187    Pre-fail  Always      -      54510
  9 Power_On_Minutes        0x0032  114  114  000    Old_age  Always      -      169h+38m
 10 Spin_Retry_Count        0x002b  253  252  157    Pre-fail  Always      -      0
 11 Calibration_Retry_Count 0x002b  253  252  223    Pre-fail  Always      -      0
 12 Power_Cycle_Count      0x0032  242  242  000    Old_age  Always      -      4733
192 Power-Off_Retract_Count 0x0032  253  253  000    Old_age  Always      -      0
193 Load_Cycle_Count        0x0032  253  253  000    Old_age  Always      -      0
194 Temperature_Celsius    0x0032  036  253  000    Old_age  Always      -      32
195 Hardware_ECC_Recovered  0x000a  253  252  000    Old_age  Always      -      7920
196 Reallocated_Event_Count 0x0008  253  253  000    Old_age  Offline      -      0
197 Current_Pending_Sector  0x0008  253  253  000    Old_age  Offline      -      1
198 Offline_Uncorrectable  0x0008  252  252  000    Old_age  Offline      -      1
199 UDMA_CRC_Error_Count    0x0008  199  199  000    Old_age  Offline      -      0
200 Multi_Zone_Error_Rate  0x000a  253  252  000    Old_age  Always      -      0
201 Soft_Read_Error_Rate    0x000a  253  252  000    Old_age  Always      -      0
202 Data_Address_Mark_Errs  0x000a  253  252  000    Old_age  Always      -      0
203 Run_Out_Cancel          0x000b  253  252  180    Pre-fail  Always      -      0
204 Soft_ECC_Correction    0x000a  253  252  000    Old_age  Always      -      0
205 Thermal_Asperity_Rate  0x000a  253  252  000    Old_age  Always      -      0
207 Spin_High_Current      0x002a  253  252  000    Old_age  Always      -      0
208 Spin_Buzz              0x002a  253  252  000    Old_age  Always      -      0
209 Offline_Seek_Performnce 0x0024  239  239  000    Old_age  Offline      -      171
210 Unknown_Attribute      0x0032  253  252  000    Old_age  Always      -      0
211 Unknown_Attribute      0x0032  253  252  000    Old_age  Always      -      0
212 Unknown_Attribute      0x0032  253  252  000    Old_age  Always      -      0

davcefai:/home/david#


davcefai 05-11-2014 03:33 AM

@ Emerson

This is the output of the command you suggested. Does it look that bad that the drive needs to be dumped? OK, good excuse to get a bigger drive, meads I don't need to dump a lot of Beethoven to DVD :-)


Code:

davcefai:/home/david# smartctl --all /dev/sda | grep -e "Reallocated_Sector_Ct" -e "Current_Pending_Sector" -e "Offline_Uncorrectable" -e "UDMA_CRC_Error_Count" -e "Hardware_ECC_Recovered"
  5 Reallocated_Sector_Ct  0x0033  253  253  063    Pre-fail  Always      -      1
195 Hardware_ECC_Recovered  0x000a  253  252  000    Old_age  Always      -      8312
197 Current_Pending_Sector  0x0008  253  253  000    Old_age  Offline      -      1
198 Offline_Uncorrectable  0x0008  252  252  000    Old_age  Offline      -      1
199 UDMA_CRC_Error_Count    0x0008  199  199  000    Old_age  Offline      -      0
davcefai:/home/david#


rknichols 05-11-2014 08:06 AM

The problem is #197, Current_Pending_Sector. That is just one bad sector, and the drive otherwise looks fine. A bad sector that is pending reallocation is visible to the OS (will cause an I/O error if read) and will cause the offline test to fail at that location. Follow the steps in the Bad Block HOWTO to get that sector reallocated. Parameter #5, Reallocated_Sector_Ct, should then increase to 2, and the offline tests should then pass. That drive hasn't been used much, just under 170 power-on hours, and you should expect it to have a normal lifetime.

The steps in the HOWTO aren't as hard as they look (it covers several different cases -- you will be concerned with just one), but if you don't want to do that, the ham-fisted approach would be to back up the files on the affected partition, clear the partition with "dd if=/dev/zero of=/dev/sda5 bs=64k", then remake the filesystem and restore the backup.

Of course if you just want a bigger disk, by all means go ahead and get one.

BTW, when you post output please use [CODE]...[/CODE] tags and not [QUOTE]...[/QUOTE] tags so that formatting is preserved.

metaschima 05-11-2014 11:05 AM

The attributes look fine, except of course for the bad sector, which is not good. You could try zeroing the HDD like rknichols suggests as this may repair soft errors. Obviously backup before doing this.

TobiSGD 05-11-2014 10:00 PM

Quote:

Originally Posted by rknichols (Post 5168778)
BTW, when you post output please use [CODE]...[/CODE] tags and not [QUOTE][/QUOTE] tags so that formatting is preserved.

Indeed, this will make your posts much more readable. I have fixed that for now.

davcefai 05-13-2014 11:18 AM

Apologies anf thanks for the format fix.

davcefai 05-13-2014 11:37 AM

I have tried following the Badblocks Howto but have run into a snag. Here follows a blow by blow account in the hope that somebody will point out where I went off the straight and narrow path.

Step 1: Find error:

Code:

davcefai:/home/david# smartctl -l selftest /dev/sda
smartctl 6.2 2013-07-26 r3841 [i686-linux-3.14-1-686-pae] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline      Completed: read failure      60%    48224        144701458
# 2  Short offline      Completed: read failure      60%    48224        144701458
# 3  Short offline      Completed: read failure      60%    48164        144701458
# 4  Short offline      Completed: read failure      60%    48163        144701458

Definitely at 144701458!

-----------------------------------------------------------------------------------------------
Step 2: Locate Partition where the error is:


Block number = 144701458 x 512 / 4096 = 18087682.25

Code:

davcefai:/home/david# fdisk -lu /dev/sda

Disk /dev/sda: 163.9 GB, 163928604672 bytes
255 heads, 63 sectors/track, 19929 cylinders, total 320173056 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0005fe80

  Device Boot      Start        End      Blocks  Id  System
/dev/sda1  *          63    39070079    19535008+  83  Linux
/dev/sda2        39070080  320159384  140544652+  5  Extended
/dev/sda5        39070143  164071844    62500851  83  Linux
/dev/sda6      164071908  203141924    19535008+  83  Linux
/dev/sda7      203141988  310568579    53713296  83  Linux
/dev/sda8      310568643  315259559    2345458+  83  Linux
/dev/sda9      315259623  320159384    2449881  82  Linux swap / Solaris
davcefai:/home/david#

18087682 must be in sda1


Step 3: Find Mount Point and fs type

looking in /etc/fstab I find:

Code:

# /dev/sda1        = /
/dev/disk/by-uuid/c22032e6-9df4-4cc9-a1ff-9b2698b4a2b7 / ext3 nouser,defaults,errors=remount-ro,atime,auto,rw,dev,exec,suid 0 1

No surprises here (I think)

Step 4: Confirm the Block Size:

Code:

davcefai:/home/david#  tune2fs -l /dev/sda1 | grep Block
Block count:              4883752
Block size:              4096
Blocks per group:        32768

Ok, 4096 as assumed earlier,

Step 5: Now to locate the inode:

Code:

davcefai:/home/david# debugfs
debugfs 1.42.9 (4-Feb-2014)
debugfs:  open /dev/sda1
debugfs:  testb 18087682
Illegal block number passed to ext2fs_test_block_bitmap #18087682 for block bitmap for /dev/sda1
Block 18087682 not in use
debugfs:  testb 18087683
Illegal block number passed to ext2fs_test_block_bitmap #18087683 for block bitmap for /dev/sda1
Block 18087683 not in use
debugfs:

I don't know what the error message means. I would appreciate being told what I am doing wrong!

rknichols 05-13-2014 12:48 PM

LBA is in 512-byte sectors. "fdisk -u" gives addresses in 512-byte sectors. (The "Blocks" column shows 1024-byte blocks.) So, your bad block is in sda5, as you first suspected.

(144701458-39070143)/8 = 13203914.375

Block 13203914 of the filesystem, 3rd sector of that 4K block.

davcefai 05-13-2014 01:34 PM

Quote:

Block 13203914 of the filesystem, 3rd sector of that 4K block
Thanks for this. However, moving along, I get:
Code:

davcefai:/home/david# debugfs
debugfs 1.42.9 (4-Feb-2014)
debugfs:  open /dev/sda5
debugfs:  testb 13203914
Block 13203914 marked in use
debugfs:  icheck 13203914
Block  Inode number
13203914        <block not found>
debugfs:

Which rather puts a damper on the proceedings. icheck takes about half a minute to run, could it be timing out? I don't see how it can not find a block it has previously found with the testb command.

Could I trouble you a little longer?

Thanks.

rknichols 05-13-2014 04:08 PM

Quote:

Originally Posted by davcefai (Post 5170105)
Code:

davcefai:/home/david# debugfs
debugfs 1.42.9 (4-Feb-2014)
debugfs:  open /dev/sda5
debugfs:  testb 13203914
Block 13203914 marked in use
debugfs:  icheck 13203914
Block  Inode number
13203914        <block not found>
debugfs:


That means that the block is used by filesystem metadata, probably by some currently free inodes. The only way I know of for finding which one is to run
Code:

dumpe2fs /dev/sda5 | less
and page down through the listing until you see block numbers in that range, e.g.
Code:

Group 4: (Blocks 131072-163839)
  Block bitmap at 131072 (+0), Inode bitmap at 131073 (+1)
  Inode table at 131074-131584 (+2)
  19111 free blocks, 7493 free inodes, 88 directories
  Free blocks: 131597-131607, 132017-132020, 132057-132064, 132079, 132120, 132622, 137249, 137273, 137281-137364, 137657-137659, 137744, 137748-141311, 141313-142311, 142558, 146290-146327, 146589, 146649-146651, 146653-146705, 146745, 147384-147458, 147460-148648, 149383, 149971-154528, 154530-154921, 154926-154961, 154963-159743, 159747, 159751-159889, 159891-159903, 159905-160234, 160247, 160249-160260, 161033-163839
  Free inodes: 32717-32854, 32856-32936, 32938, 32962, 32966, 32982, 32988-33280, 33283, 33285-34542, 34544-34550, 34552-34554, 34556-34558, 34560-34682, 34687-34692, 35305-40880

Unfortunately, the program will probably die from an I/O error at that point, but hopefully you will be able to see the "Inode table at ..." line and can confirm that the bad sector is within that inode table. The inodes in that sector pretty much have to be free or else your e2fsck would have died with an I/O error, so it should be safe to zero them. First, to be absolutely certain you have the right sector run
Code:

hdparm --read-sector  144701458
If you do get the expected I/O error from that, zero it by running
Code:

hdparm --write-sector  144701458
That should make "smartctl -A /dev/sda" report "0" for the Current_Pending_Sector count, and the Reallocated_Sector_Ct will probably increase to "2". It would be best to run "e2fsck -f /dev/sda5" just to be sure you haven't stepped on something in use.

You did say you had backups for this filesystem, right? ;)

davcefai 05-13-2014 04:30 PM

Quote:

You did say you had backups for this filesystem, right?
BackupPC, daily at 1500 :D


All times are GMT -5. The time now is 04:49 PM.