RAID1+LVM: data keeps getting corrupted
After a recent HD failure, I decided to start using (software)RAID-1 (md). But I only had one disk, but I would soon have another (smaller disk) available after I migrated my fileserver to a (software)RAID-1 using two new disks.
So I created a partition on my first disk of the size as the second disk was going to be and configured it as a degraded RAID-1. On that RAID configuration I created a LVM2 volume group on which I created 2 LVM2 volumes (home and data). That worked perfectly for a few months. A few days ago, I added that other disk to the currently degraded (software)RAID-1. It started updating and the (software)RAID became clean. Everything seemed alright. Until somewhat later, I found out my home partition was suddenly remounted read-only because of some troubles with the ext3 journaling. After an fsck it turned I had had a lot of errors on that partition, lots of inodes I had to clean or fix.. :-(
But the system came up again, with no more errors.. Until, while deleting a few big files from the data partition: read-only filesystem. In the logging again the ext3-journaling who gave up and remounted the partition read-only. Again lots of data corruption and a lot of files lost..
After another reboot, everything seemed ok again. But after a while again the home partition read-only.
Both disks never gave any problem before, and the problems started since I hot-added one disk to the initialy degraded (software)raid-1. There is no message about DriveReadySeek errors or anything alike. It's always the ext-3 journaling system that seems to find something wrong causing the drive to be remounted read-only. No other errors in the logs which could point to any hardware failure.
I decided to remove that other disk again from raid, since problems started with that disk.
But even after the removal of it, the corruption keeps going on on the LVM2 volumes.
What could have gone wrong? And how should I fix this?
Do I have to conclude that I should not put an LVM2 on a software RAID(1)?
In the meanwhile, the data corruption keeps going on..however it seems that without the second disk, fsck is now always able to repair filesystem errors on his own, so no fatal errors anymore, but every few hours my partitions get remounted read-only, and I have to fsck them to use them again...
But this isn't such a special configuration, is it? It should work without problems, not? What could have gone wrong? what do I have to mind when I start to recreate the RAID/LVM configuration? Or should I try EVMS?
Am I going to face the same problems with my fileserver, when a disk crashes -> the raid gets degraded -> I add a new disk to replace the old wone -> data corruption?
The only thing that comes to mind would be a stick of memory having intermittent problems. I had one go on an older system. It would run memtest fine for 1/2 an hour but if I left it run for a few hours the problem would show up. Stuff like that is a real PITA to diagnose.
As suggested I tried the memtest86 utillity and let it run for nearly 24hours. It did 61 passes without any errors. And since the corruption always occurs within 24hours, I think my memory is not the cause .
Also strange. I noticed since I degraded the raid again so it runs on one disk only again, only my home partition seems to get 'infected' by small filesystem corruption. My data partition doesn't seem to have any problems anymore. While running the raid on 2 disks , I had severe corruption on both partitions ... still wondering what the problem can be
I seem to have found the source of the problem.
I tried using that second disc individually without RAID or LVM and now I get a lot of errors like this:
I will now check this disk througly to find out what is going wrong on it...
Also strange that my home partition within the still degraded raid1 kept going corrupt.. I now reformated that partition into reiserfs hoping the problem goes away there too..
|All times are GMT -5. The time now is 03:25 PM.|