blk_update_request: critical medium error

aikempshall · 09-18-2019, 04:10 AM

For the first time yesterday I had these messages in my syslog

Quote:

Sep 17 08:38:26 green kernel: [ 102.679336] EXT4-fs (sdg1): warning: mounting fs with errors, running e2fsck is recommended
Sep 17 08:39:16 green kernel: [ 152.216443] blk_update_request: critical medium error, dev sdg, sector 1279266832
Sep 17 08:39:19 green kernel: [ 155.492690] blk_update_request: critical medium error, dev sdg, sector 1279266840

This went on for the next 12 hours reporting problems with various sectors.

Also see these

Quote:

Sep 17 09:28:38 green kernel: [ 3114.295660] blk_update_request: critical medium error, dev sdg, sector 1233196480
Sep 17 09:28:38 green kernel: [ 3114.295678] EXT4-fs warning (device sdg1): htree_dirblock_to_tree:959: inode #38536708: lblock 0: comm rsync: error -5 reading directory block

Finally ended with this set of messages

Quote:

Sep 17 21:15:31 green kernel: [45528.266340] blk_update_request: critical medium error, dev sdg, sector 1279266912
Sep 17 21:15:31 green kernel: [45528.266358] EXT4-fs error (device sdg1): __ext4_get_inode_loc:4110: inode #39980743: block 159908108: comm rsync: unable to read itable block
Sep 17 21:15:39 green kernel: [45536.397583] blk_update_request: critical medium error, dev sdg, sector 1279266560
Sep 17 21:15:39 green kernel: [45536.397601] EXT4-fs error (device sdg1): __ext4_get_inode_loc:4110: inode #39980034: block 159908064: comm rsync: unable to read itable block

The disk then burst into life and started doing what it was supposed to do - rsync and encryption as the first stage of backing up to the cloud.

The disk is a WD 1TB Green SATA 6Gb/s 64MB 3.5" Hard Drive more than 5 years old. Though could be older.

I'm currently copying the contents to another external hard drive. No error messages being reported in syslog.

Once I've secured and validated the data and no errors have been reported should I do

e2fsck
and maybe smartctl

Would it be sensible to cut my losses NOW and replace the hard drive.

Alex

rknichols · 09-18-2019, 01:43 PM

Based on the errors occurring at differing sectors and the age of the drive, I'd say it is due for replacement. If you post the output from "smartctl -A /dev/sdg" (wrapped in [CODE] ... [/CODE] tags, please, to preserve formatting), it will give a better picture of the drive's overall health.

At a minimum you will have to write zeros to the bad regions, which will cause the drive to reallocate the bad sectors to spares. If you continue using the drive, you will have to keep close watch on the bad sector counts. If bad sectors continue to develop, the drive will definitely need replacement.

syg00 · 09-18-2019, 06:33 PM

Hard disks are commodity items - go get another one. You've done ok to get that long out of it. My data is more important than an old piece of hardware. You can play with working around shortcomings as suggested, but only if it doesn't compromise the data. I keep an old machine for this, never on my day-to-day system to avoid "finger-checks".

aikempshall · 09-20-2019, 02:02 AM

I've now managed to copy 164GB of data from the disk.

The cp command returned 115 error messages -

Quote:

cp: cannot stat 'slack_desktop_rdiff/home/alex/rdiff-backup-data/increments/.cache/mplab_ide/dev/v4.05/var/cnd/model/129/1-.2018-12-20T13:48:02Z.diff.gz': Input/output error
cp: cannot stat 'slack_desktop_rdiff/home/alex/rdiff-backup-data/increments/.cache/mplab_ide/dev/v4.05/var/cnd/model/129/2--.2018-12-20T13:48:02Z.diff.gz': Input/output error
cp: cannot stat 'slack_desktop_rdiff/home/alex/rdiff-backup-data/increments/.cache/mplab_ide/dev/v4.05/var/cnd/model/129/45-.2018-12-20T13:48:02Z.diff.gz': Input/output error
cp: cannot stat 'slack_desktop_rdiff/home/alex/rdiff-backup-data/increments/.cache/mplab_ide/dev/v4.05/var/cnd/model/129/46-.2018-12-20T13:48:02Z.diff.gz': Input/output error
cp: cannot stat 'slack_desktop_rdiff/home/alex/rdiff-backup-data/increments/.cache/mplab_ide/dev/v4.05/var/cnd/model/129/47-.2018-12-20T13:48:02Z.diff.gz': Input/output error
cp: cannot stat 'slack_desktop_rdiff/home/alex/rdiff-backup-data/increments/.cache/mplab_ide/dev/v4.05/var/index/s76/angular': Input/output error
cp: cannot access 'slack_desktop_rdiff/home/alex/rdiff-backup-data/increments/.cache/mplab_ide/dev/v4.05/var/index/s77/angular/3/1': Input/output error
cp: cannot stat 'slack_desktop_rdiff/home/alex/rdiff-backup-data/increments/.cache/mplab_ide/dev/v4.05/var/index/s77/cnd/1': Input/output error
cp: cannot access 'slack_desktop_rdiff/home/alex/rdiff-backup-data/increments/.cache/mplab_ide/dev/v4.05/var/index/s36/js/13/1': Input/output error
cp: cannot stat 'slack_desktop_rdiff/home/alex/rdiff-backup-data/increments/.cache/mplab_ide/dev/v4.05/var/index/s75/angular/3/1': Input/output error
cp: cannot stat 'slack_desktop_rdiff/home/alex/rdiff-backup-data/increments/.cache/mplab_ide/dev/v4.05/var/index/s75/cnd': Input/output error
cp: cannot stat 'slack_desktop_rdiff/home/alex/rdiff-backup-data/increments/.cache/mplab_ide/dev/v4.05/var/index/s75/css': Input/output error
cp: cannot access 'slack_desktop_rdiff/home/alex/rdiff-backup-data/increments/.mozilla.20180224/firefox/tta2kvtd.default/storage/default/https+++ir.ebaystatic.com/idb/12183338011.files': Input/output error

I take two backups -

rdiff-backup to this failed drive;
rsync to another drive,encrypting as I rsync. I then backup the encrypted data to the cloud. This disk appears to be error free, though it is a similar age to the failing drive.

As the failing disk is used solely for rdiff-backup, I'm not too concerned if I've lost rdiff-backup data.

I've tried smartctl and depending on how I address the failing drive I smartctl returns this

Quote:

667 green: /home/alex $ sudo /usr//sbin/smartctl -i /dev/bristol
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.186] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: WD
Product: My Book 1110
Revision: 2003
User Capacity: 999,501,594,624 bytes [999 GB]
Logical block size: 512 bytes
Serial number: WCAV5F616488
Device type: disk
Local Time is: Fri Sep 20 07:36:16 2019 BST
SMART support is: Unavailable - device lacks SMART capability.

668 green: /home/alex $
668 green: /home/alex $
668 green: /home/alex $
668 green: /home/alex $ sudo /usr/sbin/smartctl -i /dev/sdf
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.186] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green
Device Model: WDC WD10EADS-11M2B3
Serial Number: WD-WCAV5F616488
LU WWN Device Id: 5 0014ee 204d9071c
Firmware Version: 80.00A80
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Fri Sep 20 07:36:35 2019 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

669 green: /home/alex $

Don't know why if I address the drive as

/dev/sdf SMART support is: Available
/dev/bristol SMART support is Unavailable

I've convinced myself it's the same drive. Pulled the plug. The results are now

Quote:

669 green: /home/alex $ sudo /usr/sbin/smartctl -i /dev/bristol
Password:
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.186] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

/dev/bristol: Unable to detect device type
Please specify device type with the -d option.

Use smartctl -h to get a usage summary

670 green: /home/alex $ sudo /usr/sbin/smartctl -i /dev/sdf
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.186] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /dev/sdf failed: No such device
671 green: /home/alex $

So I proceeded to test /dev/sdf

Quote:

675 green: /home/alex $ sudo /usr/sbin/smartctl -t short /dev/sdf
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.186] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Fri Sep 20 07:50:53 2019

Use smartctl -X to abort test.
676 green: /home/alex $

Then waited an retrieved the results

Quote:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 129 121 021 Pre-fail Always - 6508
4 Start_Stop_Count 0x0032 068 068 000 Old_age Always - 32733
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 072 072 000 Old_age Always - 20663
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 096 096 000 Old_age Always - 4450
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 21
193 Load_Cycle_Count 0x0032 148 148 000 Old_age Always - 158261
194 Temperature_Celsius 0x0022 110 095 000 Old_age Always - 37
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 192 192 000 Old_age Always - 1455
198 Offline_Uncorrectable 0x0030 185 185 000 Old_age Offline - 2520
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 101 101 000 Old_age Offline - 19826

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 20663 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

So bit confused as smartctl seems to imply that the disk is error free.

rknichols · 09-20-2019, 11:08 AM

You give no clue as to what "/dev/bristol" might be. Is that some decryption mapping of the underlying /dev/sdf device? smartctl needs the raw device. It can't reach through the encryption layer to find it.

As for /dev/sdf:

Quote:

Originally Posted by aikempshall

197 Current_Pending_Sector 0x0032 192 192 000 Old_age Always - 1455

That shows 1455 bad sectors that will cause an I/O error when read. They don't show up in the test results simply because no long test has been run since the bad sectors developed. Running "smartctl -t long /dev/sdf" would almost certainly cause a failure to be logged. Rewriting those sectors would cause them to be reallocated to spare sectors, but a number that large is often a warning that the drive will continue to develop more bad sectors and could soon fail completely. Do not rely on the overall health statement from smartctl. That is generated by the firmware on the device, and bad sectors will not cause a health warning until the drive's supply of spare sectors is nearly exhausted. That is long past the point where the drive should have been replaced.

Note that an rdiff-backup archive is a fairly complex, and not terribly robust, database. Having various bit and pieces of it missing will cause history for those elements to be unavailable.

aikempshall · 09-23-2019, 03:31 AM

Quote:

Originally Posted by rknichols

As for /dev/sdf: That shows 1455 bad sectors that will cause an I/O error when read. They don't show up in the test results simply because no long test has been run since the bad sectors developed. Running "smartctl -t long /dev/sdf" would almost certainly cause a failure to be logged. Rewriting those sectors would cause them to be reallocated to spare sectors, but a number that large is often a warning that the drive will continue to develop more bad sectors and could soon fail completely. Do not rely on the overall health statement from smartctl. That is generated by the firmware on the device, and bad sectors will not cause a health warning until the drive's supply of spare sectors is nearly exhausted. That is long past the point where the drive should have been replaced.

Not quite sure how to get the results so tried

Quote:

smartctl -l selftest /dev/sdg
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.186] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Interrupted (host reset) 90% 20707 -
# 2 Extended offline Interrupted (host reset) 90% 20700 -
# 3 Extended offline Interrupted (host reset) 90% 20698 -
# 4 Short offline Completed without error 00% 20663 -

Quote:

Originally Posted by rknichols

Note that an rdiff-backup archive is a fairly complex, and not terribly robust, database. Having various bit and pieces of it missing will cause history for those elements to be unavailable.

After transferring the data to a new drive and after doing a "rdiff-backup --check-destination-dir" on the new drive did get back some reasonable results back. Anyway nothing can be trusted. So will look, with urgency, to carry on as is using a couple of USB drives plugged into my main machine then consider building a new machine to run a RAID backup solution. The Raid solution would be more a long term solution and might be too complex for my needs.

syg00 · 09-23-2019, 03:41 AM

Code:

... then consider building a new machine to run a RAID backup solution.

RAID is not a backup solution. It is for redundancy - say when a drive fails. If you issue "rm -rf" in a RAID environment your data is still gone.
You need a separate backup strategy. Always.

rknichols · 09-23-2019, 08:40 AM

Quote:

Originally Posted by aikempshall

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Interrupted (host reset) 90% 20707 -
# 2 Extended offline Interrupted (host reset) 90% 20700 -
# 3 Extended offline Interrupted (host reset) 90% 20698 -
# 4 Short offline Completed without error 00% 20663 -

Those tests got interrupted right at the start. You have to let the test run to completion, without powering-off or rebooting.

Quote:

After transferring the data to a new drive and after doing a "rdiff-backup --check-destination-dir" on the new drive did get back some reasonable results back.

All "--check-destination-dir" does is check whether the most recent backup session failed or was interrupted before completion, and rolls back that session if that was the case.

One of the shortcomings of rdiff-backup is that it does not provide any good way to test the overall integrity of the archive. The only way to do that is to run with the "--verify-at-time" option for every session in the backup history. You can run several of those sessions in parallel in about the same time as a single session (there is a lot of commonality of disk access, so the kernel's buffer cache is a big win here), but it still takes a long time.

aikempshall · 09-23-2019, 10:39 AM

Quote:

Originally Posted by syg00

Code:

... then consider building a new machine to run a RAID backup solution.

RAID is not a backup solution. It is for redundancy - say when a drive fails. If you issue "rm -rf" in a RAID environment your data is still gone.
You need a separate backup strategy. Always.

My current backup strategy is

to rsync to an external drive called "southsea" encrypting as I go then transfer the files from that external drive, using s3cmd, to the Amazon cloud. I don't backup all files only those that I would miss if something went horribly wrong
to rdiff-backup to an external drive called "bristol". This I use if I've changed a file and some days later want to backout the change to a specific date. Might be a few changes back.

It was "bristol" that failed the other day so no great loss. I will commission a new drive and start the rdiff-backup from day 1. I would lose the history of any changes I've made to files, usually scripts/programs in the past, at the moment that doesn't bother me.

If "southsea" had failed I will commission a new drive and start the rsync process to repopulate the drive. If I've got the S3cmd rules set correctly it would only transfer the changed files to Amazon as most of the files will have the same date/time stamp and md5sum.

If a hard drive on my production machine(s) failed I would get the files back from either "bristol", "southsea" or Amazon whatever was the most appropriate. i.e. hard drive failure or building burning down.

I've reread the Raid documentation and now consider it not to be more appropriate for my situation.

I'm going to put "bristol" and "southsea" out to pasture both are of a similar age, install new drives and see if the strategy I've described above works.