Weird hard drive behavior with kernel 4.4

ordealbyfire83 · 11-03-2017, 10:16 PM

I have recently compiled kernel 4.4 to use with BLFS on a Thinkpad. I am trying this kernel branch because it seems some issues with i915 graphics _might_ have been fixed well enough to try to hibernate/resume when booting with an initrd. Before this kernel version hibernate more often than not (read: almost always, except for the occasional accident) failed because of graphics problems (invalid ROM contents - resetting anything with the gpu will make the resumed data not match the checksum of the swap/resume image). With this kernel I have reliably resumed, but I cannot use this kernel long enough to test this sufficiently because of other problems.

First, I am having serious problems using an external hard drive. This is a 2.5-inch SATA drive (traditional hard drive, not solid state) with a USB enclosure. When I connect the usb cable, I get errors such as this:

Code:

Nov  2 22:36:14 hostname kernel: [  170.563324] usb 1-1: new high-speed USB device number 3 using ehci-pci
Nov  2 22:36:14 hostname kernel: [  170.681163] usb 1-1: New USB device found, idVendor=13fd, idProduct=3940
Nov  2 22:36:14 hostname kernel: [  170.681177] usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
Nov  2 22:36:14 hostname kernel: [  170.681184] usb 1-1: Product: MK1665GSX
Nov  2 22:36:14 hostname kernel: [  170.681191] usb 1-1: Manufacturer: TOSHIBA
Nov  2 22:36:14 hostname kernel: [  170.681197] usb 1-1: SerialNumber: 30303030303030303030303030303030
Nov  2 22:36:14 hostname kernel: [  170.681780] usb-storage 1-1:1.0: USB Mass Storage device detected
Nov  2 22:36:14 hostname kernel: [  170.682714] scsi host4: usb-storage 1-1:1.0
Nov  2 22:36:15 hostname kernel: [  171.685542] scsi 4:0:0:0: Direct-Access     TOSHIBA  MK1665GSX        0204 PQ: 0 ANSI: 6
Nov  2 22:36:15 hostname kernel: [  171.686488] sd 4:0:0:0: Attached scsi generic sg2 type 0
Nov  2 22:36:15 hostname kernel: [  171.691693] sd 4:0:0:0: [sdb] Spinning up disk...
Nov  2 22:36:16 hostname kernel: [  172.692342] .ready
Nov  2 22:36:16 hostname kernel: [  172.693512] sd 4:0:0:0: [sdb] 312581807 512-byte logical blocks: (160 GB/149 GiB)
Nov  2 22:36:16 hostname kernel: [  172.694314] sd 4:0:0:0: [sdb] Write Protect is off
Nov  2 22:36:16 hostname kernel: [  172.694321] sd 4:0:0:0: [sdb] Mode Sense: 1f 00 10 08
Nov  2 22:36:16 hostname kernel: [  172.695347] sd 4:0:0:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
Nov  2 22:36:16 hostname kernel: [  172.738591]  sdb: sdb1
Nov  2 22:36:16 hostname kernel: [  172.742094] sd 4:0:0:0: [sdb] Attached SCSI disk
Nov  2 22:36:16 hostname kernel: [  173.010061] sd 4:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_SENSE
Nov  2 22:36:16 hostname kernel: [  173.010070] sd 4:0:0:0: [sdb] tag#0 Sense Key : Hardware Error [current] [descriptor]
Nov  2 22:36:16 hostname kernel: [  173.010074] sd 4:0:0:0: [sdb] tag#0 Add. Sense: No additional sense information
Nov  2 22:36:16 hostname kernel: [  173.010079] sd 4:0:0:0: [sdb] tag#0 CDB: ATA command pass through(12)/Blank a1 06 20 00 00 00 00 00 00 e5 00 00
Nov  2 22:36:17 hostname kernel: [  173.388546] sd 4:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_SENSE
Nov  2 22:36:17 hostname kernel: [  173.388554] sd 4:0:0:0: [sdb] tag#0 Sense Key : Hardware Error [current] [descriptor]
Nov  2 22:36:17 hostname kernel: [  173.388558] sd 4:0:0:0: [sdb] tag#0 Add. Sense: No additional sense information
Nov  2 22:36:17 hostname kernel: [  173.388564] sd 4:0:0:0: [sdb] tag#0 CDB: ATA command pass through(12)/Blank a1 06 20 da 00 00 4f c2 00 b0 00 00
Nov  2 22:36:37 hostname kernel: [  193.748218] EXT4-fs (dm-2): mounting ext3 file system using the ext4 subsystem
Nov  2 22:36:37 hostname kernel: [  193.807845] EXT4-fs (dm-2): mounted filesystem with ordered data mode. Opts: (null)

This would ordinarily signal a failing hard drive - but this only happens with kernel 4.4. And I even saw similar errors when I plugged in an ordinary flash drive. If I use kernel 3.2 or 3.14, the drive starts right up and all is well:

Code:

Nov  4 03:33:17 hostname kernel: [   27.695330] usb 1-1: new high-speed USB device number 3 using ehci_hcd
Nov  4 03:33:17 hostname kernel: [   27.812952] usb 1-1: New USB device found, idVendor=13fd, idProduct=3940
Nov  4 03:33:17 hostname kernel: [   27.813079] usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
Nov  4 03:33:17 hostname kernel: [   27.813170] usb 1-1: Product: MK1665GSX
Nov  4 03:33:17 hostname kernel: [   27.813232] usb 1-1: Manufacturer: TOSHIBA
Nov  4 03:33:17 hostname kernel: [   27.813286] usb 1-1: SerialNumber: 30303030303030303030303030303030
Nov  4 03:33:17 hostname kernel: [   27.814714] scsi4 : usb-storage 1-1:1.0
Nov  4 03:33:18 hostname kernel: [   28.817793] scsi 4:0:0:0: Direct-Access     TOSHIBA  MK1665GSX        0204 PQ: 0 ANSI: 6
Nov  4 03:33:18 hostname kernel: [   28.818625] sd 4:0:0:0: Attached scsi generic sg2 type 0
Nov  4 03:33:19 hostname kernel: [   28.822108] sd 4:0:0:0: [sdb] Spinning up disk....ready
Nov  4 03:33:19 hostname kernel: [   29.824408] sd 4:0:0:0: [sdb] 312581807 512-byte logical blocks: (160 GB/149 GiB)
Nov  4 03:33:19 hostname kernel: [   29.825424] sd 4:0:0:0: [sdb] Write Protect is off
Nov  4 03:33:19 hostname kernel: [   29.825497] sd 4:0:0:0: [sdb] Mode Sense: 1f 00 10 08
Nov  4 03:33:19 hostname kernel: [   29.826252] sd 4:0:0:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
Nov  4 03:33:19 hostname kernel: [   29.866420]  sdb: sdb1
Nov  4 03:33:19 hostname kernel: [   29.870627] sd 4:0:0:0: [sdb] Attached SCSI disk

It seems to me that kernel 4.4 isn't getting the "sense" information right. This is dangerous, because shortly after inserting this drive and doing some routine read/write activity, my entire desktop locked up. After seeing these errors I booted from a rescue cd (with an older kernel) and ran fsck to recover the journal (ext3).

Might anyone know what part of the kernel is responsible for this sort of thing, and be aware of any patch or any mandatory configure option that wasn't required in earlier versions?

And now for another issue: with this same kernel, i.e. vanilla kernel 4.4.14 (same version as used in Slackware) I get lockups within a minute or two after suspend/resume or hibernate/resume, and see stuff like this in the kernel log:

Code:

Nov  3 16:20:38 hostname kernel: [25082.316755] EXT4-fs (dm-0): re-mounted. Opts: commit=0
Nov  3 16:20:38 hostname kernel: [25082.319792] EXT4-fs (dm-2): re-mounted. Opts: data=ordered,commit=0
Nov  3 16:20:38 hostname kernel: [25082.322521] EXT4-fs (loop0): re-mounted. Opts: data=ordered,commit=0
Nov  3 16:20:38 hostname kernel: [25082.325421] EXT4-fs (dm-0): re-mounted. Opts: data=ordered,commit=0
Nov  3 16:20:38 hostname kernel: [25082.327396] EXT4-fs (dm-0): re-mounted. Opts: data=ordered,commit=0
Nov  3 16:20:38 hostname kernel: [25082.329399] EXT4-fs (dm-2): re-mounted. Opts: data=ordered,commit=0
Nov  3 16:20:38 hostname kernel: [25082.331298] EXT4-fs (dm-2): re-mounted. Opts: data=ordered,commit=0
Nov  3 16:22:30 hostname kernel: [25193.752804] general protection fault: 0000 [#2] SMP

where dm-0 is my root (/) partition and dm-2 is my external hard drive. Both are formatted as ext3. I know this kernel handles ext3 partitions through the ext4 driver but come on, it should know not to remount a live root ext3 partition so many times and not expect tears to fall.

I reverted commit e31fb9e00543e5d3c5b686747d3c862bc09b59f3 (i.e. the commit that purged ext3) and rebuilt, and this problem went away as I suspected. But I still cannot use my external hard drive.

ordealbyfire83 · 11-04-2017, 11:02 AM

It turns out these "sense" errors are false positives caused by a small kernel patch. Reversing this commit seems to fix the problem. Perhaps setting CONFIG_USB_UAS to 'y' may also have helped (though I did not have this set in earlier kernel 3.14). In any case I can now access my external drive with neither kernel errors nor lockups.

In addition I found using the ext4 driver for ext3 root filesystems when hibernating/resuming to be less than reliable. Maybe ext4 partitions can be unmounted/remounted after every resume but that surely caused my ext3 root partition to get hosed enough to need restoring from a backup. Now I am using the ext3 driver and everything seems to work as expected.

For those who may be interested: reversing the (long) commit that removed the ext3 driver does work. Four hunks fail when reversing the patch, although these can be easily patched by hand by looking at the relevant sections of the patch. Some lines are not in order enough for the patch program to handle them seamlessly.

I'll mark this as [Solved] if and when this kernel proves to be reliable enough for the said purpose.