newbie raid question

rusl · 07-17-2008, 01:39 PM

Hi,

I'm trying out some raid5 and raid1 partitions for the first time. All has been successful except for some mounting issues and fsck. I've been searching all around and I find people with very specific problems and advice but I think there is something basic I am not understanding about raid filesystems.

My basic questions is this: How do you most properly fsck (for regular maintenance) a RAID5 or RAID1 filesystem?

Here are some specific details: I'm using mdadm. I'm using Ubuntu Hardy. I have dmraid installed (even though I don't know what it does but it seems to make the RAID mount automatically on boot rather than manually as I had to before I installed it.) I am running LVM2 on top of the RAID. All the partitions at all levels are ext3 because I wanted to play it safe. The LVM actually spans several RAID devices. This is because my drives aren't all identical and I'm taking advantage of the awesomeness of mdadm/LVM by combining partitions.

The more specific problems I have is this: On boot I was getting errors because the fsck wasn't working in the premount stage for the RAID and LVM partitions in the fstab. So, I removed the fsck check for those partitions by setting the <dump> field to 0 for the raid partitions. Fine, no more errors but I've removed the safety check for my RAID filesystems. And I don't know how to run it otherwise.

My understanding of fsck is that you run it on an unmounted filesystem only. I also have read from Googling that fsck on a raid /dev/md0 partition can screw things up. Also with LVM? I think this is because you might mistakenly have the fsck run on the wrong level - there is the ext3 filesystem but the /dev/md0 is also a filesystem... well it has it's own UUID that is different from the UUID of the component partitions and also different from the UUID of the usuable filesystem within the RAID or in my case the LVM partition level. I find this all very confusing.

Most of the Google hits I have found for this are about people with big problems where they have run fsck on the RAID in the incorrect way and somehow wrecked things or erased the superblock?

So I once again return to the basic question. What is the proper way to fsck a LVM and RAID? Even better would be how get the default fstab settings to be error free but not skipping the safety check. I would rather not get to the point where I have a big problem. But right now I can't even figure our a safe way to just fsck manually after an unmount command - what device am I supposed to be checking and with what fsck options?

Thanks for reading and any help you can offer.

pruneau · 07-17-2008, 04:16 PM

Before helping you, would you mind to tell us about your setup.
Please run those commands when your RAID setup is working normally:
- use "pvscan -v": this is going to give infos on your "physical" device, your logical volumes and your volume groups
- use "cat /proc/mdstat": this is going tell all about your raid volumes
- give us the output of "mount", and this is going to tell what is mounted where.

Now, of course, it's _not_recommended_ to run fsck on a random device, because if fsck mistakenly recognize the device for something else that what it is, havoc will ensue. As well, running fsck on a mounted file system is also unwise, since both the kernel and fsck are concurrently updating control structures, and anything can happen.

rusl · 07-17-2008, 05:20 PM

Thanks for looking at this pruneau!

Here is the info:
$sudo pvscan -v

Code:

   Wiping cache of LVM-capable devices
    Wiping internal VG cache
    Walking through all physical volumes
  PV /dev/md0   VG raid5end   lvm2 [234.37 GB / 0    free]
  PV /dev/md1   VG raid5end   lvm2 [263.68 GB / 0    free]
  PV /dev/md2   VG raid5end   lvm2 [275.20 GB / 0    free]
  Total: 3 [773.25 GB] / in use: 3 [773.25 GB] / in no VG: 0 [0   ]

This only gives info under sudo, is that normal? I mostly use the redhat GUI to configure the LVM so far

$ cat /proc/mdstat

Code:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md6 : active raid1 sdd3[0] sdb8[1]
      21125376 blocks [2/2] [UU]
      
md5 : active raid1 sdd2[0] sdb7[1]
      21125376 blocks [2/2] [UU]
      
md4 : active raid1 sdd1[0] sdb6[1]
      40064 blocks [2/2] [UU]
      
md2 : active raid5 sda5[0] sdc7[3] sdb2[2] sdd5[1]
      288567168 blocks level 5, 128k chunk, algorithm 2 [4/4] [UUUU]
      
md1 : active raid5 sdc5[0] sda6[3] sdb3[2] sdd6[1]
      276493824 blocks level 5, 256k chunk, algorithm 2 [4/4] [UUUU]
      
md0 : active raid5 sda7[0] sdb4[3] sdd7[2] sdc6[1]
      245760000 blocks level 5, 128k chunk, algorithm 2 [4/4] [UUUU]

unused devices: <none>

I have 3 raid5 partitions where my data is going - these are combined using LVM. I may add one more set and name it /dev/md3 because there is a small bit of space I realised I could also add to the big LVM-RAID5. I have 3 RAID1 partitions for the OS. Actually these will be 2 OSs (ubuntu 32bit and Ubuntu64bit) and then a common boot partition. I have put stuff in these partitions and they seem to work but I am not actively using them to run the OS because I don't want to do that until I know things are safe and I figure out the fsck situation. So right now they are just experimental. The RAID5 on the other hand is full of data already (media files).

$ mount

Code:

/dev/sdc3 on / type ext3 (rw,errors=remount-ro)

proc on /proc type proc (rw,noexec,nosuid,nodev)
/sys on /sys type sysfs (rw,noexec,nosuid,nodev)
varrun on /var/run type tmpfs (rw,noexec,nosuid,nodev,mode=0755)
varlock on /var/lock type tmpfs (rw,noexec,nosuid,nodev,mode=1777)
udev on /dev type tmpfs (rw,mode=0755)
devshm on /dev/shm type tmpfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
lrm on /lib/modules/2.6.24-19-generic/volatile type tmpfs (rw)

/dev/sda2 on /gruboot type ext3 (rw)
/dev/md4 on /mnt/grubboot type ext3 (rw)

/dev/md6 on /mnt/ubuntu32 type ext3 (rw,noexec,nosuid,nodev,noatime)
/dev/md5 on /backup type ext3 (rw,noatime)

/dev/mapper/raid5end-raid5lvm on /now type ext3 (rw)
securityfs on /sys/kernel/security type securityfs (rw)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,noexec,nosuid,nodev)

Hope that helps, I cleared out a few of the mountpoints that aren't relevant so it is easier to read (partitions for other stuff such as backups)

Thanks again!

pruneau · 07-19-2008, 10:37 AM

OK, that'a an impressive rig you have access to, or you did gather a lot of hardware, did you ?
One of the first thing that strikes me is that your setup is somewhat convoluted/complicated.
But more on this later.

So I'll start first by describing what exists, I'll try and answer your questions, and then I'll explain how it could be modified to be easier to manage/understand.

As well, I forgot to ask you about the output of "fdisk -l", just to get an idea about the disk partitioning, but I think I got you drift.

Tell me if I'm wrong, but you appear to have 4 disks, either SCSI or SATA (probably the latest). As well, they are fairly big (even by nowadays standard), around 750/800 G, unless you used 4 1Gb disks and kept some spare free space.
You system boots from /dev/sda2, one of the non-raid boot partitions. Its root partition is /dev/sdc3.

Right there, its possible to have an explanation about your fsck failure: if you are using the original distribution kernel, it's highly possible that the modules requires to access your lvm and raid partition are not loaded at this time, and thus trying to pre-fsck those is going to fail because the device to access them simply do not exist.

If you want to access all the fancy stuff that not in a basic partition, you have to make sure that your kernel initrd (initial ram disk) contains all the driver for md and lvm. Have a look at the mkinitrd manual.

Now for the rest: you experimented and that's OK, but I'm afraid that you complicated things while experimenting, and I hope your RAID is not too full enough to be modified.

You see, RAID 1 or 5 exists to cater for failures of hard drives: so it's quite unnecessary to have _3_ RAID 5 arrays scattered on the same 4 hard drives: if one of them fails, you'll end up with three arrays to reconstruct _anyway_, exactly the same work involved as reconstructing a single raid 5 array.
Having stuff scattered in 3 RAID 5 arrays and grouping them back with LMV is a band-aid solution to a non-existent problem. It overly complicates the back-end and makes it difficult to understand an manage, because the raid partition are scattered all over the place in an effort to dilute the risks.
What's cool is that intuitively, you probably understand what's RAID is about, but you tried to do yourself what RAID5 should be doing for you. Don't worry, that a very common problem with people and computer, even experienced people.

As for the boot partition, you got it OK: having a mirrored set-up for grub is what you need. You just have to understand that for this to be redundant, you will have to install grub on both hard drive of the mirror, even if one is only going to be used in case of failure of the other.

Now, in you shoes, that's how I would configure you system.

Each disk would look like this:

for each X=a,b,c,d
#(sorry, could not resist

Device Boot Blocks Id System
/dev/sdX1 1Gb fd Linux raid autodetect # /dev/md0 RAID 5 through (optionally) LVM
/dev/sdX2 * 100M fd Linux raid autodetect # /dev/md1 with sda2 and sdb2: use sdc3 and sdd3 like you want
/dev/sdX3 * TheREST fd Linux raid autodetect # /dev/md2 RAID 5 though LMV to allow for a / and data partition

Explanation:
* SWAP: Yes, you need to put the SWAP in a raid partition, because if something goes wrong with one of the swap disks, you system is no less hosed than with a damaged root disk.
Its setup would go like this:
/dev/sda1\
/dev/sdb1| --> RAID5 /dev/md0 --> 1 volume group with 1 Logical volume using all the space.
/dev/sdc1|
/dev/sdd1/
* BOOT: well, the purpose of RAID 1 mirror is _NOT_ to have redundant fail-safe RAID access at run-time, because the boot partition is not accessed out of boot sequences and kernel/boot loader updates. The purpose of the RAID1 is to maintain an identical copy of the first partition, so if it gets damaged, you still have a working copy, and can boot from the other OK hard drive. For example, if you are really paranoid, you can mirror it on all 4 disks.
It's setup would go like this:
/dev/sda2\
/dev/sdb2| --> RAID1 /dev/md1 No need for LVM here: we just need mirrors.
/dev/sdc2|
/dev/sdd2/
*TheREST: well, you've got all the rest to use for your root and data partition.
/dev/sda3\
/dev/sdb3| --> RAID5 /dev/md2 one Volume Group with as much Logical volume as you need (maybe one, maybe more)
/dev/sdc3|
/dev/sdd3/

You see, sometime it's useful to split the available space in a lot of different partitions/file system for resiliency, because if one of them gets damaged, then the other still works, which is of somewhat importance when it comes to the root partition (and file system full condition, for example). But it's usually relevant for high-use server, so I would not bother going further that a root and a data partition. It's up to you

Of course, the big non-trivial question would be: how to migrate from you situation to the "ideal" one ?
That's another story depending on the quantity of data you have. The only thing I would recommend is to keep a backup and a boot disk handy, because some error in the installation process can easily make you system unbootable, but recoverable (I'm talking from experience here).

Now it was a long post, but I swear the next answers are going to be shorter if you have more questions.

rusl · 07-24-2008, 02:56 PM

wow, thanks again for all that pruneau!

Why the RAID5 is all over the place: I had an even more scattered system before this - but it was mostly full. I have a terrible time deleting things (I'm a pack rat at home too). So, the only way I could figure out how to move almost than a terabyte of data without resorting to buying an expensive (though the price is coming down) hard drive and waiting forever for it to shuffle around was this scheme of splitting the raid5 into 3 stripes. It's actually not as confusing as it seems as I placed them all in roughly the same layout and since they are each unique sizes I can tell the differance even if things get reordered.

Here is #fdisk -l

Code:

Disk /dev/sda: 300.0 GB, 300069052416 bytes
255 heads, 63 sectors/track, 36481 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x0001decc

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1        2828    22715878+  83  Linux
/dev/sda2            2829        2833       40162+  83  Linux
/dev/sda3            2834       36481   270277560    5  Extended
/dev/sda5            2834       14808    96189124+  fd  Linux raid autodetect
/dev/sda6           14809       26282    92164873+  fd  Linux raid autodetect
/dev/sda7           26283       36481    81923436   fd  Linux raid autodetect

Disk /dev/sdb: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x742ddd49

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1       27153   218106441    5  Extended
/dev/sdb2           27154       39128    96189187+  fd  Linux raid autodetect
/dev/sdb3           39129       50602    92164905   fd  Linux raid autodetect
/dev/sdb4           50603       60801    81923467+  fd  Linux raid autodetect
/dev/sdb5               1         788     6329515+   b  W95 FAT32
/dev/sdb6   *         789         793       40131   fd  Linux raid autodetect
/dev/sdb7             794        3423    21125443+  fd  Linux raid autodetect
/dev/sdb8            3424        6053    21125443+  fd  Linux raid autodetect
/dev/sdb9            6054        8886    22756041   fd  Linux raid autodetect
/dev/sdb10           8887        9396     4096543+  82  Linux swap / Solaris
/dev/sdb11           9397        9778     3068383+  82  Linux swap / Solaris
/dev/sdb12           9779       21893    97313706   83  Linux
/dev/sdb13          21894       24320    19494846   83  Linux
/dev/sdb14          24321       27153    22756041   83  Linux

Disk /dev/sdc: 300.0 GB, 300069052416 bytes
255 heads, 63 sectors/track, 36481 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x000121c3

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc3               1        2833    22756041   83  Linux
/dev/sdc4            2834       36481   270277560    5  Extended
/dev/sdc5           14809       26282    92164873+  fd  Linux raid autodetect
/dev/sdc6           26283       36481    81923436   fd  Linux raid autodetect
/dev/sdc7            2834       14808    96189124+  fd  Linux raid autodetect

Partition table entries are not in disk order

Disk /dev/sdd: 320.0 GB, 320072933376 bytes
255 heads, 63 sectors/track, 38913 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00019d80

   Device Boot      Start         End      Blocks   Id  System
/dev/sdd1   *           1           5       40131   fd  Linux raid autodetect
/dev/sdd2               6        2635    21125475   fd  Linux raid autodetect
/dev/sdd3            2636        5265    21125475   fd  Linux raid autodetect
/dev/sdd4            5266       38913   270277560    5  Extended
/dev/sdd5            5266       17240    96189124+  fd  Linux raid autodetect
/dev/sdd6           17241       28714    92164842   fd  Linux raid autodetect
/dev/sdd7           28715       38913    81923436   fd  Linux raid autodetect

Disk /dev/md0: 251.6 GB, 251658240000 bytes
2 heads, 4 sectors/track, 61440000 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk identifier: 0x00000000

Disk /dev/md0 doesn't contain a valid partition table

Disk /dev/md1: 283.1 GB, 283129675776 bytes
2 heads, 4 sectors/track, 69123456 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk identifier: 0x00000000

Disk /dev/md1 doesn't contain a valid partition table

Disk /dev/md2: 295.4 GB, 295492780032 bytes
2 heads, 4 sectors/track, 72141792 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk identifier: 0x00000000

Disk /dev/md2 doesn't contain a valid partition table

Disk /dev/md4: 41 MB, 41025536 bytes
2 heads, 4 sectors/track, 10016 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk identifier: 0x00000000

Disk /dev/md4 doesn't contain a valid partition table

Disk /dev/md5: 21.6 GB, 21632385024 bytes
2 heads, 4 sectors/track, 5281344 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk identifier: 0x00000000

Disk /dev/md5 doesn't contain a valid partition table

Disk /dev/md6: 21.6 GB, 21632385024 bytes
2 heads, 4 sectors/track, 5281344 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk identifier: 0x00000000

Disk /dev/md6 doesn't contain a valid partition table

I have 2 x 300G, a 320G, and a 500G. You are right they are all SATA.

That strange group of sizes (each was cheapest at the time of purchase) is also the reason I have things so weirdly setup. I basically have been trying to make lemonade out of lemons (though actually it all works pretty good - just a bit convoluted - and unfortunately it is 99% full:-()

I've basically striped the RAID5 in 3 segments making up almost all the 300G, Then I have about 40G in RAID1 between the 320G and the 500G. Then there is extra space on the 500G drive for swap and anything else less crucial. Right now however, there is still about 20G not yet put into the RAID5 because that is where the old OS's reside and I need to be sure before I move them.

I've put quite a bit of thought into this disk layout, trying to make the best of what is avaliable. I was hoping for it to be an ideal setup given what I have. Is there really any harm in the band aid RAID5 stripe setup if i can keep a handle on where the disks are? It has been useful in building things (because finding 1/3TB free space is a lot easier than an entire TB). Or does it add more overhead to the disk I/O having 3 instead of one?

I'll have to look into the initrd stuff in the kernel. I find that stuff a bit intimidating. I just use the normal kernel ubuntu sends me. I thought that the LVM, mdadm and dmraid packages would install all you needed. It does seem to work. But I'll look at that.

Back to the fsck which is really the issue at the centre of all this... putting aside the pre-mount attempts. How do I run fsck manually? I would unmount the LVM. Then do I need to stop the raid /dev/md0? What device name do I tell fsck to read and any special switches?

To be more specific lets say I've got my md0 : active raid5 made from sda7,sdb4,sdd7,sdc6 how do I fsck that?

My understanding of the fsck problem is that is comes from the /dev/md# not showing up as having a valid partition table in fdisk. But isn't that normal for fdisk and gparted to say that for LVM or RAID? Correct me if I'm wrong.

As you can tell I'm pretty tolerant of the band-aid solutions (with plans to do it properly for next time) and running fsck from a crontab would work for me if it was simplest.

Thanks tremendously for all your patience in reading this pruneau, and for taking a look at this. Nothing is more frustrating that having do something the hard way and then ask how to fix it from there! Sorry for going on and on too.

Cheers! :-)

rusl

PS Another problem I have that is only sort of related but prevents me from naming the disks more simply is that they get reordered on bootup. The 500G is sdd and half the time sdb... So I have to use a lot of disklabels and UUID in the fstab... which was confusing for a while but I eventually figured it out by reading the udevd manual. It's because the 500G is on a promise fastrax SATA PCI card due to my motherboard SATA slots are full and also can't for some reason read the 500G drive itself (because I think the BIOS firmware needs to be upgraded but I don't have a working copy of windoze and all the bios update programs seem to require that - but that's another story) Just saying that in case you wonder why I'm not naming things better.

PPS part of the reason I have tons of partitions is that I've been shuffling data around on this system for a while and I've found it faster to move data using the mounted filesystems then delete or create a small ext3 partition rather than performing 2 hour long gparted filesystem shrink/grow/move commands.

rusl · 08-07-2008, 05:47 PM

Well, I guess pruneau is busy.

I finally found something useful on Google to address this problem by searching under the fsck error that I'm getting:

Code:

The filesystem size (according to the superblock) is xxx
The physical size of the device is xxx
Either the superblock or the partiion table is likely to be corrupt!

Pith.org - How I Fixed My Raid-1 Partition Size Error

I've been following the steps. Here is a summary:

the site references another howto on RAID creation:
Step-11 - resize filesystem

When we created the raid device, the physical partion became slightly smaller because a second superblock is stored at the end of the partition. If you reboot the system now, the reboot will fail with an error indicating the superblock is corrupt.

Resize them prior to the reboot, ensure that the all md based filesystems are unmounted except root, and remount root read-only.

(rescue)# mount / -o remount,ro

You will be required to fsck each of the md devices. This is the reason for remounting root read-only. The -f flag is required to force fsck to check a clean filesystem.

(rescue)# e2fsck -f /dev/md0

This will generate the same error about inconsistent sizes and possibly corrupted superblock.Say N to 'Abort?'.

(rescue)# resize2fs /dev/md0

Repeat for all /dev/md devices.
http://tldp.org/HOWTO/Software-RAID-HOWTO-7.html
So basically the filesystem wasn't created properly by myself (it's my first time with RAID). So this may cause bad block errors!

The solution:
Code:
1. Unmount all partitions (at least all RAID /dev/md#)
2. Repair the partitions
3. Resize the partitions
There are also instructions for moving your / to /tmp if you have to fix the / as well. Fortunately I haven't yet put my / into RAID1 yet so I don't need to worry about that. But the instructions are ther if you need it. Basically, you need to be in Read Only mode to repair the /.

After you have unmounted the RAID partitions /dev/md# the,
Code:
e2fsck -cc /dev/md#
#will take a long, long time, for my 750G RAID5 over 33 hours total
e2fsck -f /dev/md#
resize2fs /dev/md#
e2fsck -f /dev/md#
That's it. Seems to work now. Actually I'm only about 1/2 way through the fsck -cc /dev/md# but the other smaller RAID1 partitions are done and seem to work. Also, I used fcsk rather than e2fsck simply because I use ext3 for everything and fsck defaults to whatever you are using anyway. Also, somehow you don't have to specify the size to resize to with resize2fs. That's nice. Apparently the whole process is non-destructive even though it takes forever

So that's it. I'll update if anything goes wrong but so far (even though the RAID is only half done) it looks peachy!

rusl · 08-08-2008, 01:26 PM

Ok, well, I guess I wasn't right, not yet solved (sigh...). The fsck -cc of my 750G RAID5 completed in only 29 hours (not the estimated 33hrs). However, I can't run the fsck on the /dev/md# because it has LVM2 partitions within it. So now I am stuck again. Many things mention not to run fsck on the partitions that make up LVM. So where does that leave me? I've fscked the partition inside the LVM (/dev/mapper/logical-volume) sucessfully so is that the end of it? Should I be fscking the /dev/sd## partitions that make up the RAID that contains the LVM? Am I done if I can fsck the LV fine?

kenoshi · 08-08-2008, 03:27 PM

If you fsck'd the logical volume, you shouldn't have to do anything else.

BTW there's nothing wrong with 3 RAID5 devices on the same drives. In the old days, this was a common practice, especially for different applications that used the drives differently and didn't require a ton of space.

This is essentially the same as carving up disk raid groups into smaller volumes on modern SAN arrays anyhoo.

pruneau · 08-11-2008, 04:59 AM

Well, sorry for the long silence, but yes, I was busy, and still is ;-(

As for the fsck, I'm not so sure what else than your file system you want to verify ? If the LMV/MD-raid structure is somehow corrupt, fsck is not the tool to verify those.

Basically, to get back to your first question (at last ;-), you should fsck whatever device is indicated into your /etc/fstab file, or the equivalent device, no more, no less.

The other device are _not_ going to be file systems, so there is no point trying to check them.

Here is a simple example to illustrate my point: if you have an usb key, say /dev/sda, you can create a file system on the _whole_ disk, with something like:

Quote:

mkfs.ext2 /dev/sda

In this case, your fstab entry is going to look like:

Quote:

/dev/sda /my/usbkey vfat user,noauto 0 0

Or, you can use fdisk to create a partition named /dev/sda1, and the equivalent mkfs operation and fstab entry are going to be:

Quote:

mkfs.ext2 /dev/sda1
/dev/sda1 /my/usbkey vfat user,noauto 0 0

The difference between the two situations ? Well, by using the partition table, you gave yourself the opportunity to have multiple partitions on the same media, but you paid the price of losing a small part of your media for the control structure, the partition table.
Using LVM and RAID is the same: you add control structures on your media, but you gain a _lot_ more flexibility and features. Hence a bit more complicated fsck.