Harddisk failing

goranbr · 07-31-2014, 06:30 AM

I just recently got this message on the root console:

Code:

!$ WARNING: Your hard drive is failing
Device: /dev/sdc [SAT], FAILED SMART self-check. BACK UP DATA NOW!

I got really worried because this is my home pc and i only take rsync backups to a NAS.
And if I have taken backups from a faulty disk to my NAS I may have overridden good files there with bad files from my faulty disk , right?

Judging from the output below can anyone tell if data has already gone missing, and I have corrupted files. How will I know which files are corrupted in that case?

OR, is this a warning that I will lose data soon? Can the disk reallocate sectors to repair itself?

I have already ordered a new disk. What I am worried about is if I have already corrupted data on my current backup. This is what I have to go on so far....

Code:

# smartctl -a /dev/sdc
=== START OF INFORMATION SECTION ===
Model Family:     Hitachi/HGST Deskstar 7K4000
Device Model:     Hitachi HDS724040ALE640
Serial Number:    PK2311PAG4P4MM
LU WWN Device Id: 5 000cca 22bc220e0
Firmware Version: MJAOA3B0
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Jul 31 12:46:36 2014 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   137   137   054    Pre-fail  Offline      -       78
  3 Spin_Up_Time            0x0007   128   128   024    Pre-fail  Always       -       579 (Average 625)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       91
  5 Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always   FAILING_NOW 1712
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   112   112   020    Pre-fail  Offline      -       38
  9 Power_On_Hours          0x0012   098   098   000    Old_age   Always       -       16801
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       91
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       787
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       787
194 Temperature_Celsius     0x0002   157   157   000    Old_age   Always       -       38 (Min/Max 23/44)
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       2897
197 Current_Pending_Sector  0x0022   001   001   000    Old_age   Always       -       3760

Ser Olmy · 07-31-2014, 06:55 AM

Quote:

Originally Posted by goranbr

And if I have taken backups from a faulty disk to my NAS I may have overridden good files there with bad files from my faulty disk , right?

Fortunately, you're wrong.

The drive may be faililng, but every time a bad sector is encountered, the drive will attempt to reallocate it to a spare sector. If this procedure succeeds, no data are lost. If the bad sector is in use and repeated attempts to read it fails with an ECC error, a read error will be returned to the operating system.

In other words, there's no way the drive will hand you bad data and pretend it's good. The chance of a corrupted sector randomly producing a valid ECC code is next to none.

Quote:

Originally Posted by goranbr

Code:

  5 Reallocated_Sector_Ct   0x0033   001   001   005    Pre-fail  Always   FAILING_NOW 1712

Code:

197 Current_Pending_Sector  0x0022   001   001   000    Old_age   Always       -       3760

1712 sectors have been successfully reallocated, and 3760 sectors are marked as bad and are awaiting reallocation. If some of those 3760 sectors are completely unreadable and contain data, you will get a read error if you try to read a file with data stored in such a sector. On the other hand, if you're able to back up your data without incident, the backup will contain only good data.

You should back up your system as soon as possible, replace the drive, and perform a full restore.

goranbr · 07-31-2014, 07:15 AM

You don't know how reassuring that was to hear... :-)

I will get my new drive today. But I have shut down my NAS and won't make any more backup until I have a new disks.

I think it is the summer heat that is destroying my disks. :-)

Anyway, thanks a lot for your input!

syg00 · 07-31-2014, 07:28 AM

In which case ... see the label "Did you find this post helpful?" - I suggest you help enhance @Ser Olmy reputation by clicking "YES"

goranbr · 07-31-2014, 07:42 AM

Quote:

Originally Posted by syg00

In which case ... see the label "Did you find this post helpful?" - I suggest you help enhance @Ser Olmy reputation by clicking "YES"

Of course, thanks for the tip! :-)

metaschima · 07-31-2014, 10:51 AM

I haven't really considered the possibility that using rsync to make backups regularly could in fact backup corrupt data. Possible solutions are to make incremental/differential backups, or to make full backups to separate files, or to backup only after some checks are run locally to make sure you're not backing up corrupt data.

goranbr · 07-31-2014, 05:24 PM

Quote:

Originally Posted by metaschima

I haven't really considered the possibility that using rsync to make backups regularly could in fact backup corrupt data. Possible solutions are to make incremental/differential backups, or to make full backups to separate files, or to backup only after some checks are run locally to make sure you're not backing up corrupt data.

Well, I interpreted the reply from "Ser Olmy" as if rsync would at least report an error if can't read a file properly from the source.
And if I don't get any errors, then at least that particular backup did not destroy any data.

However, I am still unsure what happens if rsync tries to back up a corrupt file (with data on sectors not readable at the time of backup).

Does rsync have any chance of detecting this in time to refrain from overwriting the target file?
That is, when rsync asks the OS for a file that it has chosen to transfer will the OS check to see if the whole file is readable before it hands it over to rsync?
Or does the OS just hand rsync one sector at a time sequentially, and then says "Ooops, this sector was actually unreadable!"?

As for making separate backups, this is a home setup on a home budget, with 8TB of disk on my PC and 8TB on my NAS. So I have alreay stretched my budget. :-)
I could use incremental backups I guess, but it's a more complicated backup scheme for a home setting I think.

rknichols · 07-31-2014, 05:46 PM

rsync normally creates a temporary file at the destination and, after doing that successfully, renames it over the old version. If an error occurred, the old version should be safe.

metaschima · 07-31-2014, 05:55 PM

Just because a file is readable does NOT mean it is not corrupt. I've gotten corrupt files after a power outage. They were readable, but full of garbage. Not sure what is best in your particular situation, but consider methods to prevent corrupt files from overwriting good ones. For sure do NOT backup after power outages or SMART fails until you are sure the files are good. Maybe checksums can help, but user input may be needed. I think at least keeping two backups and alternating between which is overwritten is a minimal way to prevent this from happening.

syg00 · 07-31-2014, 11:39 PM

Quote:

Originally Posted by metaschima

I've gotten corrupt files after a power outage. They were readable, but full of garbage.

I'd suggest that you got corrupted files after the fsck after the power outage.
This is the elephant in the room - fsck is designed to fix filesystems not necessarily the files in it.

So an earlier backup should be ok, but after a fsck on a" normal" filesystem that throws messages (like after an outage) I always toss the filesystem and restore in toto. If you were to use a filesystem that had checksumming (like btrfs) you could have reasonable confidence the data read is (always) good. I use RAID5 under btrfs so it can go find a good (internal) backup when it gets a CRC mismatch on data read.

goranbr · 08-01-2014, 07:59 AM

Yes, power outage is another problem which is even more disturbing....

And, whether it is SMART reporting unreadable sectors or fsck "fixing" the file system it is not exactly easy to figure out which files have been corrupted.

Is there any way to get this info in either situation that you know of?

rknichols · 08-01-2014, 09:55 AM

The Bad Block HOWTO shows how to identify the file (if any) associated with a detected bad block. Going through that procedure for more than a very small number of bad blocks is impractical. If your backup runs without encountering an I/O error, then it is safe to say that none of the files included in the backup are using any of the bad blocks.