Trouble updating grub after copying Linux to a RAID1

chadwick · 12-30-2009, 10:54 PM

I'm trying to learn how to copy a Linux installation from a single disk to a RAID1. Right now what I'm doing is a test, to teach myself how to do it, since when I get to the actual computer that I'll be doing it on I might not have as much time to figure it out. So when you look at the desired end-product of what I'm doing, it may seem a little pointless, but that's just because It's just a test setup. Unfortunately it doesn't quite work.

I'm starting with the Debian installation on the local drive on my laptop (/dev/sda) which I'm copying to an external USB drive (/dev/sdb). I've already successfully copied it to some partitions on /dev/sdb:

Quote:

/dev/sdb2 /boot
/dev/sdb3 /
/dev/sdb5 swap
/dev/sdb6 /home
/dev/sdb7 /opt

and can boot into that without a problem. That one isn't a RAID setup. So I'm okay with copying a linux installation from one disk to another.

My goal is to essentially do what I've just done in copying the Linux installation, but to copy it to a RAID1 instead of to regular /dev/sdb* partitions. But it seems I can't get grub to work, and it might be because of the RAID1, or it might be something wrong with grub or maybe it's something else. I have little experience with RAID or grub, so it's hard for me to tell.

So here's what I do. I have the following partitions and the following planned setup:

Quote:

/dev/sdb8 and /dev/sdb12 to mirror each other as /boot
/dev/sdb9 and /dev/sdb13 to mirror each other as /
/dev/sdb10 and /dev/sdb14 to mirror each other as swap
/dev/sdb11 and /dev/sdb15 to mirror each other as /home

Booted into the local hard drive on the laptop (/dev/sda), but with the USB drive connected, I do the following to the USB drive.

Set up the RAID1 array:

Code:

root-prompt# mdadm --create /dev/md0 -n 2 -l 1 /dev/sdb8 /dev/sdb12
root-prompt# mdadm --create /dev/md1 -n 2 -l 1 /dev/sdb9 /dev/sdb13
root-prompt# mdadm --create /dev/md2 -n 2 -l 1 /dev/sdb10 /dev/sdb14
root-prompt# mdadm --create /dev/md3 -n 2 -l 1 /dev/sdb11 /dev/sdb15

Format the array:

Code:

root-prompt# mkfs.ext3 /dev/md0
root-prompt# mkfs.ext3 /dev/md1
root-prompt# mkswap /dev/md2
root-prompt# mkfs.ext3 /dev/md3

Then I copy all of the contents from the /dev/sda installation to my newly-created RAID1:

Code:

root-prompt# mount /dev/md1 /mnt/mountpoint/

root-prompt# cp -av /bin /etc /initrd /initrd.img /lib \
/lib64 /root /sbin /selinux /srv /usr /var /vmlinuz \
/tmp /cdrom /media /mnt/mountpoint/ \ 

#cp left out /boot /dev /home /mnt /opt /proc /sys
#Everything else at / was included.
#Now make the appropriate directories and copy the rest 
#of the stuff over:

lucky:/home/chad# mkdir /mnt/mountpoint/boot
lucky:/home/chad# mkdir /mnt/mountpoint/dev
lucky:/home/chad# mkdir /mnt/mountpoint/home
lucky:/home/chad# mkdir /mnt/mountpoint/mnt
lucky:/home/chad# mkdir /mnt/mountpoint/opt
lucky:/home/chad# mkdir /mnt/mountpoint/proc
lucky:/home/chad# mkdir /mnt/mountpoint/sys

lucky:/home/chad# mount /dev/md0 /mnt/mountpoint/boot/
lucky:/home/chad# cp -av /boot/* /mnt/mountpoint/boot/

lucky:/home/chad# mount /dev/md3 /mnt/mountpoint/home/
lucky:/home/chad# cp -av /home/* /mnt/mountpoint/home/

Then edit /etc/fstab accordingly:

Code:

proc           /proc           proc    defaults        0       0
/dev/md1       /               ext3    defaults,errors=remount-ro 0       1
/dev/md0       /boot           ext3    defaults        0       2
/dev/md2        none           swap    sw              0       0
/dev/md3       /home           ext3    defaults        0       2
/dev/hda        /media/cdrom0   udf,iso9660 user,noauto     0       0

Then I tried a few ways of updating grub. (I'm using GNU GRUB 1.97 which I understand is confusingly also called Grub 2).

I do update-grub while booted into the non-RAID copy on the external usb drive (/dev/sdb3 mounted on /). This tells me:

Code:

root-prompt# update-grub
Generating grub.cfg ...
Found background image: moreblue-orbit-grub.png
Found linux image: /boot/vmlinuz-2.6.30-2-686
Found initrd image: /boot/initrd.img-2.6.30-2-686
Found Windows XP Home Edition on /dev/sda1
Found Debian GNU/Linux (squeeze/sid) on /dev/sda5
Found Debian GNU/Linux (squeeze/sid) on /dev/sdb9
done

So it says it found the image for the kernel that I'm running as well as /dev/sda5 (the one Linux installation on my laptop) as well as the Windows partition and /dev/sdb9.

My hope is that when I boot into the external drive, grub will give me the option of booting into the /dev/sdb9 (or /dev/md1) installation.

But when I boot into the external drive, it only gives me the option of /dev/sda5 or /dev/sdb3. There's no option for /dev/sdb9 or /dev/md1.

If I do update-grub while booted into the laptop's local drive /dev/sda, then it doesn't mention ever finding the installation at /dev/sdb9. It just says:

Code:

root-prompt# update-grub
Generating grub.cfg ...
Found background image: moreblue-orbit-grub.png
Found linux image: /boot/vmlinuz-2.6.30-2-686
Found initrd image: /boot/initrd.img-2.6.30-2-686
Found Windows XP Home Edition on /dev/sda1
Found Debian GNU/Linux (squeeze/sid) on /dev/sdb3
done

So I'm wondering if either 1) grub isn't working since it seems to be giving me inconsistent results; or 2) if the problem is due to my lack of experience with RAID; or 3) if it's something else altogether.

Any ideas?

RobertP · 12-31-2009, 01:51 AM

AAAchh! Too much complexity. It looks like you have a good command of software RAID and copying filesystems but the boot process is getting derailed. Grub is sophisticated but it still is only a glorified bootloader.

Grub 2 is still rather new. If you have the option, stick with legacy grub because it is simpler and you can edit /boot/grub/menu.lst to fix things. You may also find using a separate /boot partition makes things more complicated. Remember, the booting kernel is in its initrd and does not see your fstab until it mounts the / filesystem.

Use UUID= instead of /dev/mdx in your fstab because the devices are often named differently by different kernels/udev setups. The UUID you want is the UUID of the filesystem, not the mdadm=created device. blkid will give you the UUID attached to the filesystem when mkfs creates it. You may also get it from ls -l /dev/disk/by-uuid.

To simplify your life, when wanting to change something on the copied filesystem, use chroot so that you run the actual software in that file system. It may make no difference when you are using a copy but in general it could be a different version.

Another approach to doing such experiments is to use a virtual machine such as VirtualBox.

Another approach is to back up the system files, do a bare installation on the actual RAID and then restore to the RAID.

Another approach is to mount the copies on your file system. Change /etc/fstab appropriately and reboot. If your present system boots, the copies will then be in place. This avoids having to tinker with the bootloader.
Good luck and have fun as a computer brain surgeon...

jay73 · 12-31-2009, 02:26 AM

I don't know about GRUB2 but GRUB was pretty much incompatible with RAID and needed to be installed on a single partition.

phil.d.g · 12-31-2009, 05:08 AM

Your very nearly there actually.

There are 2 steps missing. You need to rebuild your initramfs image to include /etc/mdadm/mdadm.conf and the mdadm kernel modules, that's just a case of:

Code:

# update-initramfs -k all -u

And you need to update your grub configuration. Unfortunately I've never used GRUB2, however have a look at the configuration files in /etc/grub.d/ and /etc/default/grub. It should be straight forward as to what you need to do.

https://help.ubuntu.com/community/Grub2

I hate to be contrary, but due to mdadm.conf mdadm device names will always point to the same entity, and because mdadm is compiled as a module you need to explicitly define all the entities in that file, so there is no need to use specify mdadm devices by UUID or LABEL.

Also, while grub doesn't understand mdadm, RAID1 is a special case, there is nothing wrong with having /boot on RAID1

PS, just a personal preference - use LABELs rather UUIDs, LABELs can be meaningful, UUIDs are just strings of gibberish

RobertP · 12-31-2009, 08:39 AM

Quote:

Originally Posted by phil.d.g

I hate to be contrary, but due to mdadm.conf mdadm device names will always point to the same entity, and because mdadm is compiled as a module you need to explicitly define all the entities in that file, so there is no need to use specify mdadm devices by UUID or LABEL.

Also, while grub doesn't understand mdadm, RAID1 is a special case, there is nothing wrong with having /boot on RAID1

On an upgrade of hardware or software, the definitions of devices can change from one boot to another so it is important to use something better than /dev/sdx in the configuration. Current mdadm configs use UUID but older ones did not. Mdadm generates its own UUIDs so they will not be the same as the ones generated by mkfs.

# definitions of existing MD arrays
ARRAY /dev/md0 level=raid1 num-devices=4 UUID=e7078763:b1ee072f:63bbf015:d9e929b6
ARRAY /dev/md1 level=raid1 num-devices=4 UUID=d20ea976:84a2898e:869fcee0:0e7db3e7
ARRAY /dev/md2 level=raid1 num-devices=4 UUID=ed01ecb1:9373f6f6:4e728bbb:3bbed1c9

Without the initramfs rebuild, you could have a system that cannot mount /. If you boot from a partition in a RAID 1, be sure that the boot parameter is ro so that the RAID does not have to resync. All this is automatic on a fresh install but can be a problem for manual creation of RAID or an apt-get dist-upgrade where the kernel and its modules and drivers are likely to change.

BTW, software RAID is a wonderful tool of GNU/Linux. Redundancy, of course, is very useful but it also permits the system to seek/transfer several read files at once. It is possible to install the bootloader on each drive of the RAID 1 array so that a failure keeps bootability. You need one entry in the grub menu for each possible boot drive and you need to run the grub-install command for each drive.

phil.d.g · 12-31-2009, 09:28 AM

RobertP, I agree whole heartedly with what you just said, but you've just changed scoped (and, I don't think I was particular clear).

mdadm members should be referred to using the UUID rather than listing each device by it's /dev/{h,s}d* name. However in /etc/fstab there is no reason not to use /dev/md*, mdadm entity (md0, md1, etc) names will always refer to the same RAID volume.

The mdadm members, ie the physical devices /dev/sda1, dev/sdb1, etc do not have a defined order and as you say the device called /dev/sda might be called something else the next day.

chadwick · 12-31-2009, 10:40 AM

Thanks you guys for all your helpful comments. I can tell from your ideas that I'll be learning a lot of good stuff as I do this. Plenty of terms there that I've heard before but that refer to things I've never yet dealt with myself.

I hope to let you guys know how it progresses. Oh and happy New Year!

chadwick · 01-16-2010, 03:17 PM

I just got back to working on this issue.

I installed legacy grub and got rid of grub2 on that drive.
I think RobertP might be right that it's better to stick with the legacy grub for the time-being.

I have done the "# update-initramfs -k all -u" but haven't been able to get far enough where it would have mattered yet.

What I've done is the following:

1) /dev/sdb2 is flagged with the boot flag

2) I have the following entry in grub/menu.lst on /dev/sdb2:

Code:

title		Debian GNU/Linux, external single drive
root		(hd1,2)
kernel		/vmlinuz-2.6.30-2-686 root=/dev/sdb3 ro 
initrd		/initrd.img-2.6.30-2-686

This is the single unmirrored installation that exists on that drive, i.e. the one that I was able to copy successfully. If I boot into the external drive and choose "external single drive" then I boot into that installation no problem.

3) I also have the following entry in grub/menu.lst on /dev/sdb2:

Code:

title		Debian GNU/Linux, external array
root		(hd1,8)
kernel		/vmlinuz-2.6.30-2-686 root=/dev/md1 ro 
initrd		/initrd.img-2.6.30-2-686

This one gives me trouble.

Grub says:

Quote:

Booting external array

root (hd1,8)
Filesystem is ext2fs, partition type 0x83
kernel /vmlinuz-2.6.30-2-686 root=/dev/md1 ro

Error 15: File not found

Press any key to continue...

I also try typing in the commands to grub by hand but get the same problem. I get the "File not found" problem immediately after typing the "kernel /vmlinuz-2.6.30-2-686 root=/dev/md1 ro" command.

The files that I mention in the "external array" menu.lst entry exist, and I can verify that by typing (while booted into the "external single drive" installation):

Code:

#mount -t ext3 /dev/md0 /mnt/md0/

# ls /mnt/md0/vmlinuz-2.6.30-2-686
/mnt/md0/vmlinuz-2.6.30-2-686

# ls /mnt/md0/initrd.img-2.6.30-2-686
/mnt/md0/initrd.img-2.6.30-2-686

Clearly I'm still doing something wrong, but I'm stuck right now as to what it could be.

RobertP · 01-16-2010, 08:08 PM

"title Debian GNU/Linux, external array
root (hd1,8)
kernel /vmlinuz-2.6.30-2-686 root=/dev/md1 ro
initrd /initrd.img-2.6.30-2-686"

is telling the kernel to use /dev/md1 for root and it may not have an md driver. Point it at one partition in the RAID 1 array (with ro!!!) and it will be able to load the partition and start md and mount as in /etc/fstab.

If you are not getting that far, it probably means grub is seeing the wrong devices. Try looking for your kernel by poking around with grub:

"root (hd1,8)
Filesystem is ext2fs, partition type 0x83
kernel /vmlinuz-2.6.30-2-686 root=/dev/md1 ro"

vary (hd1,8) to be (hd0,8), (hd0,7) or (hd1,7), and see whether the kernel is found. Counting is an inexact science when the BIOS does it one way, grub another and the kernel another... This is an "off by one" error, I suspect. Grub counts from 0 so /dev/sda8 is likely (hd0,7).

Quakeboy02 · 01-16-2010, 09:09 PM

I don't want to make you completely lose focus on the fact that you're trying to install grub, but if you get completely stuck, google "Super Grub Disk" and give that a try. It can usually figure out what grub can see and set up the config file so that it sees it.

chadwick · 01-16-2010, 11:31 PM

Thanks Quakeboy02. Grub works now, but that sounds like a useful thing to try. Chances are that will come in handy in the future.

You were right RobertP. I needed to use (hd0,7).
The correct entry in menu.lst is:

Code:

title		Debian GNU/Linux, external array
root		(hd0,7)
kernel		/vmlinuz-2.6.30-2-686 root=/dev/sdb9 ro 
#kernel		/vmlinuz-2.6.30-2-686 root=/dev/md1 ro 
initrd		/initrd.img-2.6.30-2-686

I had convinced myself earlier today that hd(1,8) was correct since hd(1,2) worked for the entry

Code:

title		Debian GNU/Linux, external single drive
root		(hd1,2)
kernel		/vmlinuz-2.6.30-2-686 root=/dev/sdb3 ro 
initrd		/initrd.img-2.6.30-2-686

that's shown in my earlier post from today. It turns out I was completely oblivious to the fact that hd(1,2) actually refers to the /dev/sda3 partition instead of /dev/sdb2 partition. So I was accidentally loading the kernel and initrd from my local drive /dev/sda on my laptop (/dev/sda3 is what gets mounted to /boot on the laptop). I wasn't noticing since the appropriate partition /dev/sdb3 was still being mounted on /.

I had thought the correspondence between the a,b,c of /dev/sd?? and the 0,1,2 directly following hd in (hd?,?) would not depend on what drive you boot from. But apparently I assumed wrong, and the correspondence does depend on what drive you boot from.

Also I had to change

Code:

kernel		/vmlinuz-2.6.30-2-686 root=/dev/md1 ro

to

Code:

kernel		/vmlinuz-2.6.30-2-686 root=/dev/sdb9 ro

Let me explain why I think I had to do that. With root=/dev/md1 it says at bootup:

Quote:

Begin: Assembling all MD arrays...
Failure: failed to assemble all arrays.

and then shortly afterwards it says:

Quote:

[sdb] Attached SCSI disk

and then it hangs and cannot proceed.

With root=/dev/sdb9, it still fails assembling the arrays at the same point. However, it's able to proceed after attaching the external disk, then later on you get:

Code:

[   15.409485] md: md0 stopped.
[   15.683701] md: bind<sdb12>
[   15.684329] md: bind<sdb8>
[   15.698620] md0: WARNING: sdb8 appears to be on the same physical disk as sdb12.
[   15.698766] True protection against single-disk failure might be compromised.
[   15.699306] raid1: raid set md0 active with 2 out of 2 mirrors

and similary for md1,md2,md3.

So it seems the array still gets created properly and indeed it seems to be working fine. The warning is not important for my test case since in reality I would do it differently anyway.

Then something bad happens. After booting into the array a couple times, it reports "read-only file system" during bootup, and then when doing something that requires writing to filesystem, it fails and hangs. Maybe this was because it altered one of the partitions during bootup before putting together the raid array, but that wouldn't make sense if I'm right in thinking that ro in the menu.lst line means read-only.

Anyway, I might try e2fsck'ing the array and trying booting again to see what happens.

RobertP · 01-17-2010, 11:37 AM

"it reports "read-only file system" during bootup, "

This may indicate some I/O error on the disc. Some distros mount the / filesystem with an error option to switch to readonly.

Here's an entry in my fstab:
"/dev/sda2 / jfs errors=remount-ro"

This options allows one to take corrective action like making a backup while you still can...

Check the log to look for I/O errors. The smartmontools package may also help diagnose problems before they kill your system.

chadwick · 01-17-2010, 04:02 PM

You're right that there's definitely an error occurring with the file system on /dev/md1.

I rebooted into the laptop's local drive and did:

Code:

# umount /dev/md0 /dev/md1 /dev/md3
# e2fsck -fy /dev/md1; e2fsck -fy /dev/md0; e2fsck -fy /dev/md3;

There were a lot of errors on /dev/md1, but the other two were fine.

mdstat when booted into the laptop's local drive (but with the external drive plugged in) reports no issues:

Code:

$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb8[0] sdb12[1]
      104320 blocks [2/2] [UU]
      
md1 : active raid1 sdb9[0] sdb13[1]
      9213120 blocks [2/2] [UU]
      
md2 : active (auto-read-only) raid1 sdb10[0] sdb14[1]
      610368 blocks [2/2] [UU]
      
md3 : active raid1 sdb11[0] sdb15[1]
      1020032 blocks [2/2] [UU]
      
unused devices: <none>

but when booted into the array itself, /proc/mdstat indicates a problem:

Code:

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb8[0] sdb12[1]
      104320 blocks [2/2] [UU]
      
md1 : active (auto-read-only) raid1 sdb13[1]
      9213120 blocks [2/1] [_U]
      
md2 : active (auto-read-only) raid1 sdb10[0] sdb14[1]
      610368 blocks [2/2] [UU]
      
md3 : active raid1 sdb11[0] sdb15[1]
      1020032 blocks [2/2] [UU]
      
unused devices: <none>

When booted into the array, I try to add /dev/sdb9 to /dev/md1 but arrive at a problem:

Code:

#  mdadm --manage /dev/md1 --add /dev/sdb9
mdadm: Cannot open /dev/sdb9: Device or resource busy

Note that I don't think /dev/sdb9 is supposed to be mounted anywhere:

Code:

# cat /etc/mtab
/dev/md1 / ext3 rw,errors=remount-ro 0 0
tmpfs /lib/init/rw tmpfs rw,nosuid,mode=0755 0 0
proc /proc proc rw,noexec,nosuid,nodev 0 0
sysfs /sys sysfs rw,noexec,nosuid,nodev 0 0
udev /dev tmpfs rw,mode=0755 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
devpts /dev/pts devpts rw,noexec,nosuid,gid=5,mode=620 0 0
fusectl /sys/fs/fuse/connections fusectl rw 0 0
/dev/md0 /boot ext3 rw 0 0
/dev/md3 /home ext3 rw 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,noexec,nosuid,nodev 0 0

and I can't think of what program could possibly be using it.

Then I happen to open gparted and I look at the line corresponding to /dev/sdb9. It says:

Code:

Partition    File System    Mount Point    Size        Used        Unused         Flags
/dev/sdb9         ext3            /        8.82 Gib    6.09 GiB    2.73 GiB    
/dev/sdb13        ext3                     8.79 Gib    6.05 GiB    2.73 GiB

with a little key icon next to /dev/sdb9, probably indicating that it is in use or locked. There's no such icon next to /dev/sdb13.

Out of curiousity, I try booting into /dev/sdb13 instead of /dev/sdb9. I have an entry in menu.lst similar for the one for /dev/sdb9 that allows me to do that. First I need to repair /dev/md1 since it's broken again. I boot into the laptop's local drive and do:

Code:

root# umount /dev/md0  /dev/md1 /dev/md3

root# e2fsck -fy /dev/md1; e2fsck -fy /dev/md0; e2fsck -fy /dev/md3

# a lot of errors again on /dev/md1 but none on the other two.

root#  mdadm /dev/md1 --fail /dev/sdb9
mdadm: set /dev/sdb9 faulty in /dev/md1

root# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb8[0] sdb12[1]
      104320 blocks [2/2] [UU]
     
md1 : active raid1 sdb9[2](F) sdb13[1]
      9213120 blocks [2/1] [_U]
     
md2 : active (auto-read-only) raid1 sdb10[0] sdb14[1]
      610368 blocks [2/2] [UU]
     
md3 : active raid1 sdb11[0] sdb15[1]
      1020032 blocks [2/2] [UU]
     
unused devices: <none>

root# mdadm /dev/md1 --remove /dev/sdb9
mdadm: hot removed /dev/sdb9

root# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb8[0] sdb12[1]
      104320 blocks [2/2] [UU]
     
md1 : active raid1 sdb13[1]
      9213120 blocks [2/1] [_U]
     
md2 : active (auto-read-only) raid1 sdb10[0] sdb14[1]
      610368 blocks [2/2] [UU]
     
md3 : active raid1 sdb11[0] sdb15[1]
      1020032 blocks [2/2] [UU]
     
unused devices: <none>

root# mdadm --manage /dev/md1 --add /dev/sdb9
mdadm: re-added /dev/sdb9

#wait however long it takes for the drive to no longer be busy
#then to be certain check the file systems again

root# umount /dev/md0 /dev/md1 /dev/md3
root# e2fsck -fy /dev/md1; e2fsck -fy /dev/md0; e2fsck -fy /dev/md3;

#again with several errors in /dev/md1 but none in the other two

#make sure /etc/fstab on /dev/md1 is still okay.

root# mount -t ext3 /dev/md1 /mnt/md1/
root# cat /mnt/md1/etc/fstab
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
proc            /proc           proc    defaults        0       0
/dev/md1        /               ext3    defaults,errors=remount-ro 0       1
/dev/md0        /boot           ext3    defaults        0       2
/dev/md3        /home           ext3    defaults        0       2
/dev/md2        none            swap    sw              0       0
/dev/hda        /media/cdrom0   udf,iso9660 user,noauto     0       0

Then boot into /dev/sdb13 instead of /dev/sdb9.
Again look at /proc/mdstat:

Code:

$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb8[0] sdb12[1]
      104320 blocks [2/2] [UU]
     
md1 : active (auto-read-only) raid1 sdb9[0]
      9213120 blocks [2/1] [U_]
     
md2 : active (auto-read-only) raid1 sdb10[0] sdb14[1]
      610368 blocks [2/2] [UU]
     
md3 : active raid1 sdb11[0] sdb15[1]
      1020032 blocks [2/2] [UU]
     
unused devices: <none>

This time try adding /dev/sdb13:

Code:

# mdadm --manage /dev/md1 --add /dev/sdb13
mdadm: Cannot open /dev/sdb13: Device or resource busy

Again I look at what gparted says:

Code:

Partition    File System    Mount Point    Size        Used        Unused         Flags
/dev/sdb9     ext3                         8.82 Gib    6.07 GiB    2.75 GiB    
/dev/sdb13    ext3                /        8.79 Gib    6.04 GiB    2.74 GiB

So it's the same thing as when I booted into /dev/sdb9 except with sdb9 and sdb13 reversed, and now the little key icon is next to /dev/sdb13 instead of /dev/sdb9.

Maybe it's an issue with the fact that I say /dev/sdb9 or /dev/sdb13 in menu.lst but then say /dev/md1 in /etc/fstab. Maybe it mounts / before reading /etc/fstab. But if that were the case, then why would there even be an entry in /etc/fstab for / ? In the real-life sitation that this is going to apply to, I might or might not have the same problem since I plan to work with a pair of regular internal hard drives rather than a USB drive. But even so, it would be interesting to know what exactly's going on.

RobertP · 01-17-2010, 09:04 PM

Are you running a GUI? It could be that your desktop is grabbing the external USB drive to show an icon on the desktop somewhere... I had forgotten your first post about USB. Booting with USB could be problematic as well because of the order in which md and others start. It could take USB devices longer to get started and they might miss the show.

Some folks have used USB thumbdrives for RAID, like http://linuxgazette.net/151/weiner.html but he did not boot from them and they were all the same. md could have some problems with devices with a variety of speeds.

chadwick · 01-17-2010, 11:11 PM

Quote:

Originally Posted by RobertP

Are you running a GUI? It could be that your desktop is grabbing the external USB drive to show an icon on the desktop somewhere...

Thanks again for your comments and analysis. I have a Gnome desktop installed and that's what I log into. It's true a lot of icons come up for external drives that are automatically mounted, but none of them are /dev/sdb9 or /dev/sdb13 (the two partitions that are supposed to make /dev/md1 which is supposed to be mounted on /).

I'm trying to summarize the clues that I have so far to explain what's going on. I've numbered them below.

(1) If I specify to grub to boot using

Code:

kernel		/vmlinuz-2.6.30-2-686 root=/dev/md1 ro

as the kernel line, then an error occurs when the array is supposed to be assembled. It tries to assemble the array before the USB disk is "attached". I don't know what "attach" precisely means, but probably trying to assemble the array before it's attached leads to an error, and this in turn leads to an error when we try to do something with /dev/md1 at a later time when the disk finally does get attached. This situation leads to the system being unbootable.

(2) This leaves the option of specifying to grub to boot using either

Code:

kernel		/vmlinuz-2.6.30-2-686 root=/dev/sdb9 ro

or

Code:

kernel		/vmlinuz-2.6.30-2-686 root=/dev/sdb13 ro

where /dev/sdb9 and /dev/sdb13 are the two partitions that are supposed to be used to make /dev/md1. I'm currently booted into the RAID array using the root=/dev/sdb13 line, so everything that follows corresponds to that possibility.

(3) In either of case (1) or (2) above, /etc/fstab specifies that /dev/md1 is to be mounted at /:

Code:

/dev/md1        /               ext3    defaults,errors=remount-ro 0       1

(4) Proceeding using the option (2) since that's the only way I know how to proceed, assembly of the array still fails at the same point, but at a later point the following occurs (copied from dmesg):

Code:

[   15.401600] md: md3 stopped.
[   15.694998] md: bind<sdb15>
[   15.695628] md: bind<sdb11>
[   15.712177] md3: WARNING: sdb11 appears to be on the same physical disk as sdb15.
[   15.712316] True protection against single-disk failure might be compromised.
[   15.712524] raid1: raid set md3 active with 2 out of 2 mirrors
[   15.714306] md3: detected capacity change from 0 to 1044512768
[   15.714439]  md3: unknown partition table
[   16.131049] md: md2 stopped.
[   16.133733] md: bind<sdb14>
[   16.134486] md: bind<sdb10>
[   16.136564] md2: WARNING: sdb10 appears to be on the same physical disk as sdb14.
[   16.138384] True protection against single-disk failure might be compromised.
[   16.138939] raid1: raid set md2 active with 2 out of 2 mirrors
[   16.139131] md2: detected capacity change from 0 to 625016832
[   16.139272]  md2: unknown partition table
[   16.382413] md: md1 stopped.
[   16.384609] md: bind<sdb9>
[   16.386206] raid1: raid set md1 active with 1 out of 2 mirrors
[   16.386477] md1: detected capacity change from 0 to 9434234880
[   16.386617]  md1: unknown partition table
[   16.621810] md: md0 stopped.
[   16.664601] md: bind<sdb12>
[   16.665225] md: bind<sdb8>
[   16.666715] md0: WARNING: sdb8 appears to be on the same physical disk as sdb12.
[   16.666862] True protection against single-disk failure might be compromised.
[   16.667400] raid1: raid set md0 active with 2 out of 2 mirrors
[   16.667634] md0: detected capacity change from 0 to 106823680
[   16.667764]  md0: unknown partition table

Note already the difference between md1 and the others. The entries for md3, md2 and md0 all have bind statements for two different devices whereas the entry for md1 only mentions bind<sdb9> with no mention of a corresponding bind<sdb13>. This could be related to the fact that I specified root=/dev/sdb13 in the kernel line of /boot/grub/menu.lst, so possibly /dev/sdb13 is already being used and therefore gets left out.

So because of this, I don't think it's the GUI since the odd behaviour of /dev/md1 already seems to be appearing before the GUI ever exists.

(5) Once I'm booted into the system and logged into my desktop, /etc/mtab says that /dev/md1 is mounted at / and does not mention anything about /dev/sdb13 being mounted:

Code:

# cat /etc/mtab
/dev/md1 / ext3 rw,errors=remount-ro 0 0
tmpfs /lib/init/rw tmpfs rw,nosuid,mode=0755 0 0
proc /proc proc rw,noexec,nosuid,nodev 0 0
sysfs /sys sysfs rw,noexec,nosuid,nodev 0 0
udev /dev tmpfs rw,mode=0755 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
devpts /dev/pts devpts rw,noexec,nosuid,gid=5,mode=620 0 0
fusectl /sys/fs/fuse/connections fusectl rw 0 0
/dev/md0 /boot ext3 rw 0 0
/dev/md3 /home ext3 rw 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,noexec,nosuid,nodev 0 0

This is after I have unmounted some extraneous partitions that were automatically mounted, none of which were /dev/sdb9 or /dev/sdb13.

(6) According to /proc/mdstat, /dev/sdb13 is not being used: only /dev/sdb9 is being used for /dev/md1:

Code:

# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 sdb8[0] sdb12[1]
      104320 blocks [2/2] [UU]
      
md1 : active (auto-read-only) raid1 sdb9[0]
      9213120 blocks [2/1] [U_]
      
md2 : active (auto-read-only) raid1 sdb10[0] sdb14[1]
      610368 blocks [2/2] [UU]
      
md3 : active raid1 sdb11[0] sdb15[1]
      1020032 blocks [2/2] [UU]
      
unused devices: <none>

(7) gparted claims that /dev/sdb13 is mounted at / but does not state any mount point for /dev/sdb9.
Rembember, this is corresponding to the situation when I selected to boot into the /dev/sdb13 partition from grub.

(8) If I try to unmount /dev/sdb13 as root then I get the following:

Code:

# umount /dev/sdb13 
umount: /dev/sdb13: not mounted

(9) If I try to add /dev/sdb13 to /dev/md1, then mdadm doesn't let me since it says /dev/sdb13 is busy.

(10) If I do "lsof /dev/sdb13" at the command line, then I get a whole bunch of stuff including all kinds of things associated with gnome such as gnome-session, gnome-keyring-daemon, nm-applet, gnome-screensaver and so on, plus applications that I've called myself such as firefox, gparted and bash.

(11) If I do "lsof /dev/sdb9" at the command line, then nothing gets printed to the screen at all.

(12) If I do "lsof /dev/md1" at the command line, then again nothing gets printed to the screen at all.

So it seems that /dev/sdb13 is the one that's actually being used, even though /etc/mtab doesn't think that it's mounted.

(13) If I boot into something other than this array on the USB drive, like into the laptop's local drive for example, then there's no problem. The array behaves normally. /dev/sdb9 and /dev/sdb13 are both part of /dev/md1 and no error is seen until I do an e2fsck of /dev/md1. Then it tells me there are errors in the filesystem.

(14) If I try to reboot into one of the partitions of the array, then it isn't sufficient to just to an e2fsck of /dev/md1. I have to force /dev/md1 to fail, remove /dev/sdb13 from the array, reinsert it, allow the two partitions in the array to resync, then do an e2fsck of /dev/md1, make sure /etc/fstab and perhaps other important files weren't destroyed, and then I can reboot.

I've repeated some things that I already mentioned in previous posts, but that's because I'm trying to summarize what I know.

As you say RobertP, maybe I'm reaching the limits of what I can do with a USB drive. So chances are good that I won't run into this problem in an actual practical scenario. But there's something interesting about what's going on here. It makes me want to understand why it's working the way it's working, or why it's even working at all for that matter, and I think if I could understand it I'd have the chance of understanding Linux better.

An example of some questions on my mind:

Q1. Why do /etc/mtab and umount not know that /dev/sdb13 is mounted, but yet gparted knows that /dev/sdb13 is mounted? If /etc/mtab and umount are right and /dev/sdb13 isn't mounted, then how is /dev/sdb13 still being used? If gparted is right and /dev/sdb13 is mounted, then why does /etc/mtab not know this? Did /etc/mtab come too late in the game or something?

Q2. How does gparted determine what's mounted and where, since it seems to be aware of something that /etc/mtab doesn't know about?

Q3. Am I right in thinking that there's some kind of discrepancy that occurs due to a difference between the partition that /boot/grub/menu.lst specifies as root on the kernel line and the partition that /etc/fstab specifies as being mounted on /?

Q4. What else can I do or look at in order to get clues about what's going on?