[SOLVED] Partition Errors and Remounts Read-Only when Accessing Specific File

derekpock · 05-11-2016, 05:03 PM

I have a pretty basic system running Ubuntu 16.04, 1 HDD, running a few partitions:

Code:

sda1 - EXT4 - 100G   - /
sda2 - EXT4 - 723.5G - /home
sda3 - NTFS - 100G   - (windows)
sda5 - SWAP - 8G

Whenever I try to access one of 3-4 files in a specific directory in the `/home` partition, (the specific folder causing the issues is `/home/path/to/broken/folder`), the `/home` partition will error and remount read-only. `dmesg` shows the following errors:

Code:

EXT4-fs error (device sda2): ext4_ext_check_inode:497: inode <b>#1415</b>: comm rm: pblk 0 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
Aborting journal on device sda2-8.
EXT4-fs (sda2): Remounting filesystem read-only
EXT4-fs error (device sda2): ext4_ext_check_inode:497: inode <b>#1417</b>: comm rm: pblk 0 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
EXT4-fs error (device sda2): ext4_ext_check_inode:497: inode <b>#1416</b>: comm rm: pblk 0 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)

So I understand what is going on...some bad block is causing an error and is remounting the drive read-only to prevent further corruption. I know it is these specific files because I can undo the error by

1. Logging in as root
2. Running `sync`
3. Stopping `lightdm` (and all sub-processes)
4. Stop all remaining open files on `/home` by finding them with `lsof | grep /home`
5. Unmounting `/home`
6. Running `fsck /home` (fixing the errors)
7. Remount `/home`

Everything is fine again, read and write, *until I try to access the same files again*, then this entire process is repeated to fix it again.

The way I've tried to access the files is by running `ls /home/path/to/broken/folder` and `rm -r /home/path/to/broken/folder`, so it seems any kind of HDD operation on that part of the drive errors it and throws it into read-only again.

I honestly don't care about the files, I just want them gone. I am willing to remove the entire `/home/path/to/broken/folder` folder, but every time I try this, it fails and throws into read-only.

I ran `badblocks -v /dev/sda2` on my hard drive, but it came out clean, no bad blocks. Any help would still be greatly appreciated.

Here is some information on the problem inode.

Code:

$ debugfs -R 'stat &lt1415&gt' /dev/sda2
debugfs 1.42.13 (17-May-2015)
Inode: 1415   Type: regular    Mode:  0644   Flags:  0x80000
Generation: 0    Version: 0x00000000
User:     0   Group:     0   Size: 0
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x5639ad86 -- Wed Nov  4 01:02:30 2015
atime: 0x5639ad86 -- Wed Nov  4 01:02:30 2015
mtime: 0x5639ad86 -- Wed Nov  4 01:02:30 2015
Size of extra inode fields: 0
EXTENTS:

Now I looked at this myself and compared it to what I suspect to be a non-corrupted inode:

Code:

$ debugfs -R 'stat &lt1410&gt' /dev/sda2
debugfs 1.42.13 (17-May-2015)
Inode: 1410   Type: regular    Mode:  0644   Flags:  0x80000
Generation: 0    Version: 0x00000000
User:     0   Group:     0   Size: 996
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x5639ad31 -- Wed Nov  4 01:01:05 2015
atime: 0x5639ad31 -- Wed Nov  4 01:01:05 2015
mtime: 0x5639ad31 -- Wed Nov  4 01:01:05 2015
Size of extra inode fields: 0
EXTENTS:
(0):46679378

I have bolded what I believe are the key differences here. I looked at other non-corrupted inodes and they display something similar to the 1410 that has a non-zero size and an extent.

Bad header/extent makes sense here...it has no extent...
How can I fix this without re-formatting the entire `/home` drive?

I really feel like I've handed this question to someone smarter than me on a silver platter, I just don't know what the meal (answer) is!

rknichols · 05-11-2016, 10:09 PM

Quote:

Originally Posted by derekpock

6. Running `fsck /home` (fixing the errors)

I trust you meant to say "fsck /dev/sda2" there, since "fsck /home" would never work.

What errors did fsck find and correct?

Quote:

Here is some information on the problem inode.

Code:

$ debugfs -R 'stat &lt1415&gt' /dev/sda2
debugfs 1.42.13 (17-May-2015)
Inode: 1415   Type: regular    Mode:  0644   Flags:  0x80000
Generation: 0    Version: 0x00000000
User:     0   Group:     0   Size: 0
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x5639ad86 -- Wed Nov  4 01:02:30 2015
atime: 0x5639ad86 -- Wed Nov  4 01:02:30 2015
mtime: 0x5639ad86 -- Wed Nov  4 01:02:30 2015
Size of extra inode fields: 0
EXTENTS:

That is all completely normal for a zero-length file. Nothing in the dmesg output you posted indicates a hardware error. That should have shown up as an "ata[n]" error prior to the first EXT4-fs error.

Quote:

How can I fix this without re-formatting the entire `/home` drive?

One way would be to use "debugfs -w /dev/sda2" and use its clri command to zero out the affected inodes. You would then need to run "fsck -f /dev/sda2" to clean up the resulting filesystem inconsistencies.

derekpock · 05-12-2016, 12:38 PM

First, (this may be false for older versions of fsck), but `fsck /home` is the same as `fsck /dev/sda2` as long as it is in "/etc/fstab".

The only error fsck came up with was "Data contains a file system with errors, check forced."
Then it goes through the five passes and finishes. Any run after that without accessing the problem files will say the drive is clean.

Now to the best part - I did what you suggested:

Code:

debugfs -w /dev/sda2
:clri <1415>
:clri <1416>
:clri <1417>
:q
fsck -y /dev/sda2

...and it worked! The bad inodes are gone and my system is fixed! Thanks so much!
For anybody else having this issue, I found my bad inodes (1415-1417) by running `find` on the bad mounted partition and then reading `dmesg` for the errors on the bad inodes.

rknichols · 05-12-2016, 08:37 PM

Quote:

Originally Posted by derekpock

First, (this may be false for older versions of fsck), but `fsck /home` is the same as `fsck /dev/sda2` as long as it is in "/etc/fstab".

Apparently it's been that way for quite a while. I just never knew about it.

Quote:

The only error fsck came up with was "Data contains a file system with errors, check forced."
Then it goes through the five passes and finishes. Any run after that without accessing the problem files will say the drive is clean.

Unless the flags in the super block indicate the filesystem was not cleanly unmounted, you need to use the "-f" option to make fsck actually do anything.

Quote:

Now to the best part - I did what you suggested:

Code:

debugfs -w /dev/sda2
:clri <1415>
:clri <1416>
:clri <1417>
:q
fsck -y /dev/sda2

...and it worked! The bad inodes are gone and my system is fixed!

A bit of a shame to lose that example of an error condition that fsck.ext4 fails to detect, really. The authors might have been interested to know just what was wrong. Collecting the data needed for the bug report would have been a bit of a problem, though.

Glad to hear it all worked out.

derekpock · 05-12-2016, 10:56 PM

I believe I might have tried the "-f" option before, it may have fixed some errors, but the issue still didn't resolve. I would have been grateful to send someone a disk image of the partition, but the machine it is on was high-use, and I needed a fix as soon as possible. Thanks again for the help!

derekpock · 05-13-2016, 12:18 AM

rknichols, I've experienced the same issue again with another file I didn't find before. Bad inodes again, here is the output of `fsck -fy /home`:

Code:

e2fsck 1.42.13 (17-May-2015)
fsck from util-linux 2.27.1
[/sbin/fsck.ext4 (1) -- /home] fsck.ext4 -fy /dev/sda2 
Data: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 19136782 has zero dtime.  Fix? yes

Inodes that were part of a corrupted orphan linked list found.  Fix? yes

Inode 19137402 was part of the orphaned inode list.  FIXED.
Inode 19137647 was part of the orphaned inode list.  FIXED.
Inode 19137648 was part of the orphaned inode list.  FIXED.
Inode 19137907 was part of the orphaned inode list.  FIXED.
Inode 19138044 was part of the orphaned inode list.  FIXED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -(76579640--76579644) -(76579787--76579796) -(78687758--78687762) -(78933505--78933509)
Fix? yes

Free blocks count wrong for group #2337 (4783, counted=4798).
Fix? yes

Free blocks count wrong for group #2401 (870, counted=875).
Fix? yes

Free blocks count wrong for group #2408 (1420, counted=1425).
Fix? yes

Inode bitmap differences:  -19136782 -19137402 -(19137647--19137648) -19137907 -19138044
Fix? yes

Free inodes count wrong for group #2336 (6631, counted=6637).
Fix? yes


Data: ***** FILE SYSTEM WAS MODIFIED *****
Data: 472231/47423488 files (0.5% non-contiguous), 122481385/189664000 blocks

Here's another interesting problem. I'm running into these files by running baobab (disk utilization tool) in the /home directory. I tried doing it again to see if the fsck output is different, and it is! Here is another output from fsck after trying baobab again, running into the same troublesome file:

Code:

fsck from util-linux 2.27.1
[/sbin/fsck.ext4 (1) -- /home] fsck.ext4 -fy /dev/sda2 
Data: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found.  Fix? yes

Inode 19136782 was part of the orphaned inode list.  FIXED.
Inode 19137402 was part of the orphaned inode list.  FIXED.
Inode 19137647 was part of the orphaned inode list.  FIXED.
Inode 19137648 was part of the orphaned inode list.  FIXED.
Deleted inode 19137895 has zero dtime.  Fix? yes

Inode 19137907 was part of the orphaned inode list.  FIXED.
Inode 19138044 was part of the orphaned inode list.  FIXED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -(76579640--76579644) -(76579781--76579785) -(76579787--76579791) -(78758406--78758410) -(79299589--79299593) -(79506945--79506949)
Fix? yes

Free blocks count wrong for group #2337 (4784, counted=4799).
Fix? yes

Free blocks count wrong for group #2403 (2028, counted=2033).
Fix? yes

Free blocks count wrong for group #2420 (1668, counted=1673).
Fix? yes

Free blocks count wrong for group #2426 (1406, counted=1411).
Fix? yes

Inode bitmap differences:  -19136782 -19137402 -(19137647--19137648) -19137895 -19137907 -19138044
Fix? yes

Free inodes count wrong for group #2336 (6630, counted=6637).
Fix? yes


Data: ***** FILE SYSTEM WAS MODIFIED *****
Data: 472231/47423488 files (0.5% non-contiguous), 122481385/189664000 blocks

What is going on here, and is there any way I can reliably get you some info / maybe a copy of my disc for you to look at?

rknichols · 05-13-2016, 09:11 AM

Now, fsck is finding and correcting errors. That is quite different from the previous cases, where fsck just cleared the error flag in the super block and didn't find anything else wrong.

Did the filesystem spontaneously go read-only again? That "recovering journal" message indicates that the filesystem was still dirty. Assuming that the filesystem had always previously been unmounted properly, the most likely cause for repeated corruption like this would be hardware issues. Have you run an overnight memory test on this system recently? There don't seem to be any reported errors from the disk drive, but a "smartctl -t long" might be appropriate.

Beyond that, this is way above my pay scale. I'm really not familiar with the internals of ext4. Even if a compressed QCOW2 e2image file of the metadata (see the manpage for e2image) were small enough to send to me, I doubt I could tell anything from it beyond what fsck already reported.

derekpock · 05-13-2016, 05:24 PM

Yes, the filesystem went read-only. I disabled "errors=remount-ro" for the time being, so I can continue working on the drive even when it complains about the bad inodes. I am running "smartctl -t long /dev/sda2" now.

Sounds good, though, I'll see what I can do. I may end up having to reformat the drive or even get a new one. I'll post the results of smartctl here when it is finished.

derekpock · 05-13-2016, 07:32 PM

Here is the results from smartctl:

Code:

smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-21-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST1000DM003-9YN162
Serial Number:    W1D1RS8X
LU WWN Device Id: 5 000c50 05e2eb097
Firmware Version: CC62
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Fri May 13 19:30:07 2016 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  584) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 118) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x3085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail  Always       -       139838856
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1140
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   083   060   030    Pre-fail  Always       -       4499716643
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       13896
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   020    Old_age   Always       -       1136
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       1 1 1
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   065   061   045    Old_age   Always       -       35 (Min/Max 26/36)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       149
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       1164
194 Temperature_Celsius     0x0022   035   040   000    Old_age   Always       -       35 (0 17 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       13964h+13m+06.315s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       133056486775971
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       78067852097742

SMART Error Log Version: 1
ATA Error Count: 1
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 13894 hours (578 days + 22 hours)
  When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 00 00 00 00 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  00 00 00 00 00 00 00 ff   1d+04:51:31.483  NOP [Abort queued commands]
  b0 d4 00 82 4f c2 00 00   1d+04:51:10.552  SMART EXECUTE OFF-LINE IMMEDIATE
  b0 d0 01 00 4f c2 00 00   1d+04:51:10.501  SMART READ DATA
  ec 00 01 00 00 00 00 00   1d+04:51:10.490  IDENTIFY DEVICE
  ec 00 01 00 00 00 00 00   1d+04:51:10.490  IDENTIFY DEVICE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     13896         -
# 2  Extended offline    Aborted by host               90%     13894         -
# 3  Extended captive    Interrupted (host reset)      90%     13894         -
# 4  Extended offline    Aborted by host               90%     13894         -
# 5  Vendor (0x50)       Completed without error       00%         2         -
# 6  Vendor (0x50)       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Emerson · 05-13-2016, 08:09 PM

I'd replace the SATA cable.

derekpock · 05-13-2016, 08:48 PM

Okay, I can do that. What alerts you that it might be the cable?

rknichols · 05-13-2016, 09:13 PM

The SMART report looks fine -- nothing apparently wrong with the drive internally.

SATA cable? Maybe. I don't see anything that points to that, but it's an easy thing to try.

Next thing I'd try is a good, long memory test -- at least overnight, long enough for several complete test cycles.

Emerson · 05-13-2016, 09:17 PM

Quote:

Originally Posted by derekpock

Okay, I can do that. What alerts you that it might be the cable?

Because there are errors (1,7), but the drive is OK. I'd say it is cable or SATA port or the power supply is out of specs.

rknichols · 05-13-2016, 09:33 PM

It's a Seagate drive, and the raw values in those parameters are not simple error counts. http://www.users.on.net/~fzabkar/HDD..._RRER_HEC.html for more info. In any event, those are internal events in the drive, unrelated to the quality of the SATA cable.

derekpock · 05-13-2016, 10:32 PM

Yea, I changed the SATA cable, no difference. By memory test, do you mean the program you can boot into from grub before selecting Ubuntu? Or is there another program / method of memory testing I should try?