[SOLVED] USB based root filesystem getting corrupted
Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
I have a problem concerning a linux installation on an usb stick. I installed debian on a 8GB Intenso USB stick to free all the IDE connectors for data drives (raid5). The stick is bootable and has three partitions (boot, root, swap). Both boot and root are ext3 and are equally affected by the problem.
The system runs without problems after startup. It takes about 3 hours before the following appears in the logs:
[65218.818785] attempt to access beyond end of device
[65218.818810] sde2: rw=1217, want=21569394000, limit=11718656
[65218.818830] Buffer I/O error on device sde2, logical block 2696174249
[65218.818849] lost page write due to I/O error on sde2
[65218.818868] Aborting journal on device sde2.
[65218.820095] ext3_abort called.
[65218.820110] EXT3-fs error (device sde2): ext3_journal_start_sb: Detected aborted journal
[65218.820142] Remounting filesystem read-only
However problems tend to arise before that. The filesystem gets massively corrupted, mostly beyond fsck repairs. I am well aware of the fact that usb flash devices tend to come with some problems regarding persistency, but this behavior is well beyond what is to be expected. The system can not be rebooted, fsck can be forced to fix stuff but ends up corrupting the filesystem beyond recognition (/lib/ld-linux.so getting mangled, init being unreadable, binaries and directories getting lost without ending up in lost+found). I checked the stick for bad sectors, but have not found anything that would justify this behavior.
The filesystem was originally ext4. After the given problem arose I downgraded it to ext3 (mkfs -t ext and then copied a sound backup of the root filesystem back to the stick), but that did not solve the problem. In any case I doubt that the error is related to the filesystem itself and that switch to JFS or something similar would solve the problem if a well tested FS like ext3 failes so massively.
I have searched for a reason why a page write would extend beyond the end of the device and found the following problem in the partition tables.
Partition 2 has different physical/logical beginnings (non-Linux?):
phys=(15, 140, 62) logical=(16, 77, 59)
Partition 2 has different physical/logical endings:
phys=(745, 1, 24) logical=(781, 133, 32)
Partition 2 does not end on cylinder boundary.
Partition 2: previous sectors 11968511 disagrees with total 11414612
All partitions on the usb stick have this problem, even after a fresh fdisk. I assume that is because the device is not a physical disk to which the concept of a physical sector applies (?) but this would explain why the filesystem screws up pagewrites by writing to beyond the flash devices extends.
Please do not suggest to switch the filesystem or use a "more capable" partioner. I have full trust in both fdisk and ext3, having used these tools for more than a decade (ext2 before ext3). Other suggestions on how this problem can be overcome are more than welcome.
Hello jefro and RockDoctor, thank you both for your replies.
I grabed a new usb flash drive (this time from Transcend). I created a new partition table which verified ok. Then I disabled the swap, both by declaring the partition as linux (83), formatting it as ext3 and removing the entry from fstab. Just to be sure, I added noswap to the extlinux boot parameters.
I also rechecked the new drive for bad sectors/blocks. fsck -c did not find any.
I run into the very same error condition. Even more interestingly this did not even take 3 hours this time, it happened at boot. Just after "Waiting for root filesystem" came an entire batch of "attempt to access beyond end of device". Of course the system did not boot after that.
The Chipset of the computer is VIA, which I have very bad experiences with in general. There is no support for USB3.0, so I guess that aspect is taken care off. However I will try to run in USB1.1 mode an see if this fixes the problem. I do suspect that the USB Controller has a problem with USB HighSpeed devices in some sense. It is quite strange that the USB Mode is configurable in the BIOS... I will come back to you after I have played around with the USB settings a bit.
the distro is Debian (6). I am sorry for not being clear about this: I am not forcing the USB1.1 support via the linux (neither do i plan to enforce uhci drivers and the like). The Chipset configuration (BIOS )has more or less detailed configuration options concerning usb. It can enable/disable usb support, enable usb 2.0 support and has an option for choosing between full and highspeed usb 2.0 (12MB/s, 480MB/s respectively). This distinction between Full and HighSpeed is something I have only seen implemented on VIA Chipsets so far, so I am not really sure what to make of it. The system was using HighSpeed until now.
I do not believe that the linux and the associated IO/FS modules have a problem. The kernel in question is 2.6.32-5 by the way, but I agree in the fact that this is a hardware or chipset issue. I have for not taken the following actions:
(1) restore a sound backup of partition tables, bootloader and partition contents on the usb flash drive, including a newly created ext3 fs.
(2) I have removed all swap partitions from fstab and blkid.
(3) I have not initialized the swap partition at all. My OpenSUSE 11.3 at least was unable to use it as swap, so attempts by debian to enable that partition as swap should fail as well.
(4) I have have disabled USB1.1 support in the BIOS altogether.
So far I have not seen any of the related error messages on the running system. The uptime is however just above 6 hours. I have written a script that writes a 100MB large file every couple of hours on the flash drive and then deletes it again, just to produce some heavy IO activity. As I said, no problem so far.
If by tomorrow there are no filesystem error, I will test USB2.0 using FullSpeed.
As I said I suspect the USB2.0 BIOS settings to be the problem by now. Thank you jeffro for the value hint concerning the chipset.
I have seen many bios's with the choice to enable 2.0. It does seem to affect some odd programs in windows, dunno why. The slower usb speed will take forever to load and run. It may solve timings issues.
So 14 hours of runtime should propably be attributed to both the fullspeed usb mode and the disabled swap.
And obviously the swap is related to the problem... But I have to admit that I do not know why... RockDoctor addressed this issue pretty straightforward. Is there a history of swap on flash drives corrupting neighboring partitions?
I've not known swap to corrupt another filesystem. I was just running out of ideas. I have a tendency to do full installs on my fastest 4GB flash drive. Without a swap, I actually have room for some personal files, so I just run without a swap partition or file. In my case, running without a swap partition doesn't seem to slow things down significantly - obviously, YMMV.
My only problem with data corruption on flash drives is with persistence files when using live CD images, and it's only been the persistence file that gets clobbered.
I believe I might have a reasonable explanation for the problem. Obviously there is no point in blaming linux or the fs modules for the corruption. I believe both have been well tested and exist for way too long to have such a major bug. I could blame the VIA Chipset, but the principle of enabling swap and having the adjacents partition fs fail cannot really be explained that way.
So I started looking at what is using the swap at the given time. It's actually a bit painstaking, given that the system fails short after that error, but I traced the problem to the VMWare Server. The corruption occurs when the Hypervisor attempts to move running virtual machines to swap memory. In particular the error occurs once the hostd process begins allocating swap memory (seen in /proc/<pid>/smaps).
This makes somewhat more sense, as vmware does use lowlevel access to manage memory allocation. An error in that kernel module would explain the catastrophic effects of the filesystem error. It also explains why the partition sde2 keeps getting corrupted worse and worse over time, even when sde2 is auto remounted as read only, as the kernel module handling the fs access is practically not in charge of the errorous IO requests anymore.
Though turning of swapping machine memory is an option, there are two reasons I do not want to go that way:
- For one, I only have 2GB of RAM. I do want the machine being most active to be able to allocate physical memory and not confine the allocation of the host system.
- I don't really feel comfortable using linux without a swap memory.
I will try the following workaround an see if it fixes things:
I will reformat the sde3 partition to ext3 and ceate a 1GB image file. I will mount that file using a loop device as swap. Since it is not a real device, it should not be affected by a process trying to access it beyond it's "physical" extends.
Get a new system. Too many people have a usb running in native linux. I have had a few usb's that stunk but not one system that stunk. There are only so many things to test. I assume you have checked the md5/shal of this iso.
It's not the system either. Debian is ok, the installation was network based and the installer iso was verified.
I'm giving up on the stick and moving the installation onto the raid array. The stick will be used for booting and backups.
I'm flagging this thread as closed/solved. Here's a quick recap in case anyone should stumble in here:
USB Flash drive based Debian 6.0.6 installation using 3 partitions (boot/ext3,root/ext3,swap) kept corrupting the filesystem of the root partition after/during large IO operations. The stick was mainly used to run a VMWare Server Hypervisor. After running for 3 hours, the filesystem would be completely unusable (not restorable using fsck). No unreadable sectors could be found on the usb flash drive and it passed all read/write tests.
Changing VIA Chipset Parameters for USB Host (USB1.1, USB2 Full/HighSpeed);
mounting ext3 using nobarrier and sync;
disabling vmware hypervisor and associated kernel modules;
disabling and loop containing swap partition;
None found. Workaround;
Debian Root partition was simply copied from a healthy backup to a free/new 6G raid partition and the uuid's of the root partition were changed in the bootloader and fstab. The USB Drive now servers as boot partition for extlinux. No further problems detected after that.
Thank you again for your help jefro and RockDoctor.