LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Hardware (https://www.linuxquestions.org/questions/linux-hardware-18/)
-   -   Harddisk failing - What measures to take (https://www.linuxquestions.org/questions/linux-hardware-18/harddisk-failing-what-measures-to-take-4175512831/)

goranbr 07-31-2014 06:30 AM

Harddisk failing - What measures to take
 
I just recently got this message on the root console:

Code:

!$ WARNING: Your hard drive is failing
Device: /dev/sdc [SAT], FAILED SMART self-check. BACK UP DATA NOW!

I got really worried because this is my home pc and i only take rsync backups to a NAS.
And if I have taken backups from a faulty disk to my NAS I may have overridden good files there with bad files from my faulty disk , right?

Judging from the output below can anyone tell if data has already gone missing, and I have corrupted files. How will I know which files are corrupted in that case?

OR, is this a warning that I will lose data soon? Can the disk reallocate sectors to repair itself?

I have already ordered a new disk. What I am worried about is if I have already corrupted data on my current backup. This is what I have to go on so far....

Code:

# smartctl -a /dev/sdc
=== START OF INFORMATION SECTION ===
Model Family:    Hitachi/HGST Deskstar 7K4000
Device Model:    Hitachi HDS724040ALE640
Serial Number:    PK2311PAG4P4MM
LU WWN Device Id: 5 000cca 22bc220e0
Firmware Version: MJAOA3B0
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Jul 31 12:46:36 2014 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x000b  100  100  016    Pre-fail  Always      -      0
  2 Throughput_Performance  0x0005  137  137  054    Pre-fail  Offline      -      78
  3 Spin_Up_Time            0x0007  128  128  024    Pre-fail  Always      -      579 (Average 625)
  4 Start_Stop_Count        0x0012  100  100  000    Old_age  Always      -      91
  5 Reallocated_Sector_Ct  0x0033  001  001  005    Pre-fail  Always  FAILING_NOW 1712
  7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -      0
  8 Seek_Time_Performance  0x0005  112  112  020    Pre-fail  Offline      -      38
  9 Power_On_Hours          0x0012  098  098  000    Old_age  Always      -      16801
 10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -      0
 12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      91
192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      787
193 Load_Cycle_Count        0x0012  100  100  000    Old_age  Always      -      787
194 Temperature_Celsius    0x0002  157  157  000    Old_age  Always      -      38 (Min/Max 23/44)
196 Reallocated_Event_Count 0x0032  001  001  000    Old_age  Always      -      2897
197 Current_Pending_Sector  0x0022  001  001  000    Old_age  Always      -      3760


Ser Olmy 07-31-2014 06:55 AM

Quote:

Originally Posted by goranbr (Post 5212455)
And if I have taken backups from a faulty disk to my NAS I may have overridden good files there with bad files from my faulty disk , right?

Fortunately, you're wrong.

The drive may be faililng, but every time a bad sector is encountered, the drive will attempt to reallocate it to a spare sector. If this procedure succeeds, no data are lost. If the bad sector is in use and repeated attempts to read it fails with an ECC error, a read error will be returned to the operating system.

In other words, there's no way the drive will hand you bad data and pretend it's good. The chance of a corrupted sector randomly producing a valid ECC code is next to none.

Quote:

Originally Posted by goranbr (Post 5212455)
Code:

  5 Reallocated_Sector_Ct  0x0033  001  001  005    Pre-fail  Always  FAILING_NOW 1712
Code:

197 Current_Pending_Sector  0x0022  001  001  000    Old_age  Always      -      3760

1712 sectors have been successfully reallocated, and 3760 sectors are marked as bad and are awaiting reallocation. If some of those 3760 sectors are completely unreadable and contain data, you will get a read error if you try to read a file with data stored in such a sector. On the other hand, if you're able to back up your data without incident, the backup will contain only good data.

You should back up your system as soon as possible, replace the drive, and perform a full restore.

goranbr 07-31-2014 07:15 AM

You don't know how reassuring that was to hear... :-)

I will get my new drive today. But I have shut down my NAS and won't make any more backup until I have a new disks.

I think it is the summer heat that is destroying my disks. :-)

Anyway, thanks a lot for your input!

syg00 07-31-2014 07:28 AM

In which case ... see the label "Did you find this post helpful?" - I suggest you help enhance @Ser Olmy reputation by clicking "YES"

goranbr 07-31-2014 07:42 AM

Quote:

Originally Posted by syg00 (Post 5212481)
In which case ... see the label "Did you find this post helpful?" - I suggest you help enhance @Ser Olmy reputation by clicking "YES"

Of course, thanks for the tip! :-)

metaschima 07-31-2014 10:51 AM

I haven't really considered the possibility that using rsync to make backups regularly could in fact backup corrupt data. Possible solutions are to make incremental/differential backups, or to make full backups to separate files, or to backup only after some checks are run locally to make sure you're not backing up corrupt data.

goranbr 07-31-2014 05:24 PM

Quote:

Originally Posted by metaschima (Post 5212576)
I haven't really considered the possibility that using rsync to make backups regularly could in fact backup corrupt data. Possible solutions are to make incremental/differential backups, or to make full backups to separate files, or to backup only after some checks are run locally to make sure you're not backing up corrupt data.

Well, I interpreted the reply from "Ser Olmy" as if rsync would at least report an error if can't read a file properly from the source.
And if I don't get any errors, then at least that particular backup did not destroy any data.

However, I am still unsure what happens if rsync tries to back up a corrupt file (with data on sectors not readable at the time of backup).

Does rsync have any chance of detecting this in time to refrain from overwriting the target file?
That is, when rsync asks the OS for a file that it has chosen to transfer will the OS check to see if the whole file is readable before it hands it over to rsync?
Or does the OS just hand rsync one sector at a time sequentially, and then says "Ooops, this sector was actually unreadable!"?

As for making separate backups, this is a home setup on a home budget, with 8TB of disk on my PC and 8TB on my NAS. So I have alreay stretched my budget. :-)
I could use incremental backups I guess, but it's a more complicated backup scheme for a home setting I think.

rknichols 07-31-2014 05:46 PM

rsync normally creates a temporary file at the destination and, after doing that successfully, renames it over the old version. If an error occurred, the old version should be safe.

metaschima 07-31-2014 05:55 PM

Just because a file is readable does NOT mean it is not corrupt. I've gotten corrupt files after a power outage. They were readable, but full of garbage. Not sure what is best in your particular situation, but consider methods to prevent corrupt files from overwriting good ones. For sure do NOT backup after power outages or SMART fails until you are sure the files are good. Maybe checksums can help, but user input may be needed. I think at least keeping two backups and alternating between which is overwritten is a minimal way to prevent this from happening.

syg00 07-31-2014 11:39 PM

Quote:

Originally Posted by metaschima (Post 5212796)
I've gotten corrupt files after a power outage. They were readable, but full of garbage.

I'd suggest that you got corrupted files after the fsck after the power outage.
This is the elephant in the room - fsck is designed to fix filesystems not necessarily the files in it.

So an earlier backup should be ok, but after a fsck on a" normal" filesystem that throws messages (like after an outage) I always toss the filesystem and restore in toto. If you were to use a filesystem that had checksumming (like btrfs) you could have reasonable confidence the data read is (always) good. I use RAID5 under btrfs so it can go find a good (internal) backup when it gets a CRC mismatch on data read.

goranbr 08-01-2014 07:59 AM

Yes, power outage is another problem which is even more disturbing....

And, whether it is SMART reporting unreadable sectors or fsck "fixing" the file system it is not exactly easy to figure out which files have been corrupted.

Is there any way to get this info in either situation that you know of?

rknichols 08-01-2014 09:55 AM

The Bad Block HOWTO shows how to identify the file (if any) associated with a detected bad block. Going through that procedure for more than a very small number of bad blocks is impractical. If your backup runs without encountering an I/O error, then it is safe to say that none of the files included in the backup are using any of the bad blocks.


All times are GMT -5. The time now is 05:46 PM.