Trouble with Slack 10.2 RAID 0 install on SATA drives

vonst · 05-21-2006, 12:35 PM

re: http://www.linuxquestions.org/linux/..._on_Slack_10_2

<<I have some problems with my setup. Sorry that I needed to crosspost. I should have written my thread here first, I think.>>

This article is great! It just so happened that I needed to setup a 4-disk RAID 0 system using the Slackware 10.2 install disks. It took me a couple tries, but I presently have /dev/md0 running. Setup installed the 10.2 build on /dev/md0 and my boot directory is: mount /dev/sda3 /boot. After several tries, LILO happily added itself to my MBR on /dev/sda.

/dev/md0 consists of: /dev/sda1 + /dev/sdb1 + /dev/sdc3 + /dev/sdd1

I cannot boot.

Per the article, I made my raiddev file and set persistent-superblock 1. Supposedly, the 4 RAID partitions (all marked "FB" (Linux RAID Autodetect) are supposed to be recognized by bootup and made into a RAID array for me. RAID0 in my case.

=-=-=-=-=-= THE PROBLEM =-=-=-=-=-=-=-=
Boot up screen consistently shows that it cannot find the superblocks and complains
"invalid raid superblock on"...
=-=-=-=-=-= =-=-=-=-=-=-=-=

I've googled this to death. I guess my problem is pretty rare.

Here are a couple other points to note:

1. I have 4 SATA harddrives. I selected "sata.i" as my startup kernel, as it was the only one that would work: bare.i and raid.s don't see SATA drives, test26.s doesn't see RAID.

2. When booting the sata.i kernel, it makes the same complaints about not seeing superblocks for my FD'd partitions.

3. This system is going to be dual-boot with WinXP. I have 4 partitions lying in wait to become XP raid, but I don't want to talk about them. They are not setup at all, just defined as "NTFS". I have 1 partition as my XP "C:" drive for now. It works, but LILO won't load it either right now. I have one FAT32 partition that bridges Linux and WinXP. 12 partitions total: 5 NTFS, 1 FAT32, 1 ext3, 4 FB, and of course 1 swap.

4. Maybe this matters... each of the FB partitions are 120 MB. I want a 480 MB RAID 0 drive! (You know, give or take a meg or 10.)

Is there an answer to my problem? I've been looking for some switch, or options that I can put in LILO so that LILO will suddenly start functioning right. I don't know what to do at this point!

Aerich Strobel
Alexandria, VA

meetscott · 05-22-2006, 09:30 AM

Here's my software raid 1 lilo.conf:
boot = /dev/md2
raid-extra-boot="/dev/hda,/dev/hdb"
message = /boot/boot_message.txt
prompt
timeout = 1200
default = Linux-2.6.16
vga = 791

# Linux bootable partition config begins
image = /boot/vmlinuz
root = /dev/md2
label = 2.4.31-bare.i
read-only
# Linux bootable partition config ends

# Linux bootable partition config begins
image = /boot/bzImage-hi-mem
root = /dev/md2
label = 2.4.31-hi-mem
read-only
# Linux bootable partition config ends

# Linux bootable partition config begins
image = /boot/bzImage-2.6.16
root = /dev/md2
label = Linux-2.6.16
read-only
# Linux bootable partition config ends

As it stands now, either partition will boot the system. You will be hard pressed to find someone who is striping across 4 disks. The failure exposure is too high. You're completely dead if only one disk fails. I personally have about approximately one failure every year to year and a half. If you have the same luck as me that's every 3 to 4 1/2 months for you!

As for fstab:
/dev/md1 swap swap defaults 0 0
/dev/md2 / reiserfs defaults 1 1
/dev/md3 /home reiserfs defaults 1 2

The system is handling raid transparently. I provided this because it may (or may not) be useful to you. I Actually, I only boot the 2.6 kernel. The others were for testing different compilations. The 2.4 kernel is a little faster but doesn't have some of the features I want.

Although my home system is EIDE, I have a similar SATA system at work. This should not matter. I run 2.6 on my SATA drive at work too... all Slackware. Your core issue is lilo and possibly kernel configuration/recompilation. Start with lilo man pages and then look at rolling your own kernel for features you may (or may not?) be missing. Hope this helps.

vonst · 05-22-2006, 12:35 PM

I was successfully able to pull LILO off of my disks... It's a mess to put it on /dev/sdc and then say, oh, maybe /dev/sda... and then forget to remove the one. Interestingly, Linux's /dev/sdc is WinXP's C:. /dev/sda is disk #3... (Which gets LILO in the MBR?)

I see that you have a lot of different variables than I do and they show up differently. Firstly, you're using RAID1. When I tried to set "boot=/dev/md0" in lilo.conf, it said "only for RAID1!" You're also using reiserfs. I know nothing about it, having chosen to move to ext2 when ReiserFS was having problems and people were saying "it's going away" a long time ago.

I don't know how to roll a kernel... I think I'm going to try compiling a kernel on this machine and then copying it over into my /dev/md0 on my other machine. I don't know that it'll work, but hey why not? (I'll be passing it the long way: FAT32 from this box, thru thumbdrive to WinXP on my other box, to FAT32 on that box to my /boot, when I carefully find /dev/md0 on that box! Complicated.)

If all else fails, there's another HOWTO for moving a running system from one disk to a RAID partition. I'll just start over, install on my /dev/sda3 disk, get it updated a bit and running with a tweaked kernel, and then copy the whole thing over to /dev/md0.

--vonSt
PS: Thanks for the failure stats for RAID. So far, the posts I've read speak to the possibility of failure, but never say how often it happens... One question (since this is my first try): If a disk "fails" is that complete? $100 out the window and buy another one? Or is it one disk "fails" in the RAID, the partition is jacked and you need to fdisk your partitions back into shape and start all over?

meetscott · 05-22-2006, 03:05 PM

Wow lots of stuff! Here goes...

Lilo conf:
You made a key point here and you may not even realize it. I once had a system that would not allow me to set the Master/Slave settings the way I wanted... cable didn't work either. So I had to lie to the system and put lilo on the disk that system insisted on seeing first (remember this was regardless of the hardware configuration inside the physical computer). Lilo has a way of switching the master/slave order after it takes over. I believe this will apply to SATA in the same way. I can't remember what I did or I would give you an example. You may not need this but the point is you can configure it if you need to. I think your system is not going to the MBR of the disk you think it is. You have 2 options here. Change the BIOS settings so that the system checks the disk that has the MBR on it first (which may not be an available option) or install lilo on the disk the system is checking first. Stay with me... I'll tell you how you can force that at the end here.

File System:
The file system should not matter. Reiser is going to be around for a long time. It's up for debate but I personally feel it will become the "winning" standard. I've been using it for years with no problems. It performs very well! It seems others have reported problems with ReiserFS. I would be inclined to give them some credit, but my experience with it has been so positive, I'm leaning toward some sort of misuse/misconfiguration on other people's part. I don't know what to say about those testimonies. The only other thing I would say for you is to use a journaling file system and not ext2. I used it for a long time. Ext2 served me well, but it can easily corrupt and leave an inconsistent file system state when there is a failure of some kind. You don't have to chose Reiser but choose some journaling file system.

Kernel:
For now, forget about recompiling the kernel until you get these other things working. Unless you can get some help, you're asking for more problems. There's a lot to it. You may have no choice though because what you need may not be in the stock kernel.

Disk Failures:
These disk failures I've had have been complete failures... like buy a new disk failures. I think I probably use mine more than most people (I run servers with commodity PC hardware 'cause I'm a cheapo) but that's my experience going back to 1996. Every 12 to 18 months, I lose one. On a mirrored system, you buy a new disk, stick it in, configure it and the system will sync it up automagically. Takes a few hours but then it's ready for the next failure! You lose one on a striped system (raid 0) and you're not only buying a new disk (you only need to replace the failed one), you're starting over from scratch. Striping will store parts of a single file on multiple disks. Better speed/space for the money but you lose one and I don't think you'll be able to recover anything. It's up to you if this is okay with you. There are other raid configurations that can make this easier to swallow. You always give up some space for data safety, even with raid 5, though you lose less than mirroring.

Hacking lilo from a rescue boot:
Boot your rescue media, cd's, floppies, whatever. Mount the root file system you are trying to fix (so in this case you are going to have to make sure you use a rescue kernel with support for your raid system.) Change directory into the file system you just mounted. Issue "chroot ." and that's current directory as opposed to period. After editing lilo.conf on the host system (not the temporary rescue system) issue something like: /sbin/lilo -C /etc/lilo.conf (Note: this is the default config file location, but I gave it to you in case you have to force something different).

Lots of stuff... let me know if you need clarification on anything. This is a huge dump of information.

vonst · 05-22-2006, 06:50 PM

Excellent information. Right now, I want to RAID 0 w/ 4 disks for fun. I want to see just how screaming fast it can get. I bought 4 WD's (3 WD2500's and 1 WD3200) from Newegg.com, so I hope that they are quality. Also, I'm only running a desktop. Sure, it's overkill, but it's for fun! (And all this frustrating configuration is my brand of hack Slack fun, too.)

This is what I did between my last post and now... I read "man lilo" and "man lilo.conf" very, very carefully. That allowed me to put "boot=/dev/sda" in lilo.conf and "lilo -b /dev/sdc" on the command line. It worked like a champ, even tho it complained about the different disks.

My problem has not fixed ...

Ha! Caught the little bugger!

May 22 19:29:50 pluto sshd[16645]: Did not receive identification string from 21
0.87.160.194
May 22 19:30:30 pluto sshd[16648]: Failed password for root from 210.87.160.194
port 56387 ssh2
May 22 19:30:30 pluto sshd[16651]: Failed password for root from 210.87.160.194
port 56387 ssh2

I dropped carrier two seconds after he started...

Anyway, as I was saying, I compiled the latest kernel on my old machine and handcarried it over to my new machine and installed it on /boot. I'm still stuck. I believe that the problem is NOT my install, but has something to do with ... I don't know, actually.

In the initial post, I put 2 lines on the screen and called it "THE PROBLEM". It's still the problem. All my RAID boots have been reporting this. The installer sata.i and sata.i (in /boot) AND my new "I thought I did everything right" kernel that I copied over (and put in /boot). I got a copy of the dmesg from the installer. I've never understood most of the data there, but "md: could not import sda1!" is pretty clear to me. BUT I CAN STILL MKRAID AND GET /DEV/MD0 TO WORK RIGHT!

So, I think I'm going to move this post up a level and change it to "Here's a copy of my dmesg. How do I fix it?"

Thanks for your help so far!!!

--vonSt

meetscott · 05-23-2006, 09:41 AM

Sorry we weren't able to get a further with it. Good idea to repost and see if you can get someone else to look at it. I hadn't looked at your other post, so I ended up repeating some information. I was too lazy to follow up on it before

I was just wondering... what's the output of /proc/mdstat ? Here's mine. Stuff is working properly here.

root@webhost1:/proc# cat mdstat
Personalities : [raid1] [multipath]
md1 : active raid1 hdb2[1] hda2[0]
2008000 blocks [2/2] [UU]

md2 : active raid1 hdb5[1] hda5[0]
20008832 blocks [2/2] [UU]

md3 : active raid1 hdb6[1] hda6[0]
222138688 blocks [2/2] [UU]

md0 : active raid1 hdb1[1] hda1[0]
40064 blocks [2/2] [UU]

unused devices: <none>

When you say you have stuff installing/working okay, I just wonder if the system IS working properly. Yours will be different because you're striping. Time for some fresh eyes and ideas.

vonst · 05-23-2006, 03:41 PM

Typing the response, instead of forwarding it to this machine...

Quote:

cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5]
read_ahead 1024 sectors
md0: active raid0 sdd1[3] sdc3[2] sdb1[1] sda1[0]
468744512 blocks 16k chunks

unused devices: <none>

Mind you, this is what comes up after I load up the sata.i kernel and see all my "invalid raid superblock" errors, AND AFTER I mkraid the partitions from the command line.

To me the weird thing is that on the motherboard, there are 4 SATA ports in a line. I graphed them. From left to right, WinXP sees them as "C: D: E: F:", but Linux sees them as "sdc sdd sda sdb". I really don't know what the BIOS sees them as...

As you can see in the /proc/mdstat output, Linux is happy as spit to have sda[0] sdb[1] sdc[2] sdd[3], but LILO and the kernel don't seem to be happy about it or something...

Sigh, and I took a look, it's been 24 hours and nobody responded to the post. Given the turn around time here, that's not too good.

--vonSt

meetscott · 05-24-2006, 11:58 AM

Let's not give up. I don't like to lose and I hope you keep trying. This looks good. It appears to be working properly. Like you say, lilo seems to be the issue. If you're getting this output with sata.i then I would say stick with it and things should be fine.

The lilo problem appears to be similar to the one I was mentioning earlier. In this case we don't care what XP sees. It's only going to be important when we want to boot XP with lilo. Have you tried to install lilo on either a boot floppy or on the MBR of sdc instead of sda? I want to know if lilo can start the process at all. If it can, then we know that the only problem is _WHICH_ MBR the bios is sending us to first in order to boot the system. See if you can find this out and we'll move from there. We gotta get this working. What if I want to try something like this someday. I MIGHT need to know how this works ;-)

cwwilson721 · 05-24-2006, 01:45 PM

Were you trying to install lilo in 'expert' mode? If not, try that. I have had problems w/lilo if not using expert mode.

One other 'minor' point (I'm sure you looked into it already, but did not mention):

Does you BIOS have a "antivirus" or "protect MBR" setting in it? If so, disable that. It won't allow writing to MBR if it is set, thus messing up lilo....

And DO set lilo to MBR of first disk, i.e. /dev/sda

vonst · 05-24-2006, 06:44 PM

I should have put my dmesg info on this thread as well as the new thread:

I was finally able to remove LILO from both my /dev/sda and /dev/sdc. I read and researched a lot to get there. In doing that, I learned how to put boot=/dev/sda in lilo.conf and then run lilo -b /dev/sdc from the command line. It worked like a charm, in that it didn't fail; on reboot, the prompt came up; and WinXP loaded in just fine. But, the Linux load failed just like before.

Here's part of the dmesg where SATA comes in. I don't understand the ID values and such...

Quote:

libata version 1.10 loaded.
sata_nv version 0.6
PCI: Setting latency timer of device 00:07.0 to 64
PCI: Setting latency timer of device 00:08.0 to 64
ata1: SATA max UDMA/133 cmd 0x9F0 ctl 0xBF2 bmdma 0xD400 irq 11
ata2: SATA max UDMA/133 cmd 0x970 ctl 0xB72 bmdma 0xD408 irq 11
ata1: dev 0 cfg 49:2f00 82:746b 83:7f01 84:4023 85:7469 86:3c01 87:4023 88:407f
ata1: dev 0 ATA, max UDMA/133, 488397168 sectors: lba48
ata1: dev 0 configured for UDMA/133
ata2: dev 0 cfg 49:2f00 82:746b 83:7f01 84:4023 85:7469 86:3c01 87:4023 88:407f
ata2: dev 0 ATA, max UDMA/133, 488397168 sectors: lba48
ata2: dev 0 configured for UDMA/133
ata3: SATA max UDMA/133 cmd 0x9E0 ctl 0xBE2 bmdma 0xC000 irq 10
ata4: SATA max UDMA/133 cmd 0x960 ctl 0xB62 bmdma 0xC008 irq 10
ata3: dev 0 cfg 49:2f00 82:746b 83:7f61 84:4023 85:7469 86:3c41 87:4023 88:407f
ata3: dev 0 ATA, max UDMA/133, 625142448 sectors: lba48
ata3: dev 0 configured for UDMA/133
ata4: dev 0 cfg 49:2f00 82:746b 83:7f01 84:4023 85:7469 86:3c01 87:4023 88:407f
ata4: dev 0 ATA, max UDMA/133, 488397168 sectors: lba48
ata4: dev 0 configured for UDMA/133

Would this say anything about how the kernel is choosing its boot priority?

Speaking of boot priority, I got into my Phoenix Award WorkstationBIOS and found the part where I can choose my boot priority. This is what it shows:

1. Ch2 M. : 3.2G HD (with its whole serial number)
2. Ch3 M. : 2.5G HD ( " )
3. Ch4 M. : 2.5G HD ( " )
4. Ch5 M. : 2.5G HD ( " )
5. Bootable Add-in Cards

I checked this out before even starting the software installs: Ch2-5 represent the 4 SATA ports on my MB from left to right. WinXP sees Ch2 as C:, but Linux seems to think it should be Ch4. Or maybe I should say, Ch4 is /dev/sda. Ch2 is /dev/sdc.

cwwilson, I have specifically loaded LILO to /dev/sda's MBR. Bootup ignores it and loads WinXP directly. Ya, know, I never tryied to put LILO on /sdb or /sdd... Also, I double checked, I have no BIOS virus protectors. As you can see, LILO loads fine into the /dev/sdc MBR.

I'm pretty sure it'll hose everything, but I'll try changing around my boot priority too...

--vonSt

vonst · 05-24-2006, 09:12 PM

Oh, this is fabulous... Test case 1.

I switched cables for #1 and #2, then I "switched" them back in the BIOS. (Note the differences with the post above...)

1. Ch3 M. : 3.2G
2. Ch2 M. : 2.5G
3. Ch4 M. : 2.5G
4. Ch5 M. : 2.5G

The following things happened:
1. LILO booted.
2. WinXP loaded just fine. (Linux actually loaded, but kernel panicked like before.)
3. Superblocks were still invalid and I couldn't RAID on login. (when I booted with the install disk).
4. Linux ignored the bios and reassigned as follows:

1. /sdc 2. /sdd 3. /sda 4. /sdb just like before!

However, since I switched cables, /dev/sdd now has 3.2G, and /dev/sdc has 2.5G.

Logic tells me that the BIOS boots in order as given by the BIOS (in this case 2134). Linux, for whatever reason, boots 3412 and doesn't care what the bios says. So, I'm not sure it's going to matter, but tomorrow I'm going to plug my 3.2G into slot 3. Then the 3.2G will be /dev/sda.

There's a big BUT in that, tho. My /boot directory cannot be on the 3.2G. Due to my partitioning scheme, it has to be on one of the 2.5G's. The only thing this exercise is telling me is that the kernel ignores the BIOS and loads 3412.

--vonSt

meetscott · 05-25-2006, 10:09 AM

This is not surprising. This is exactly what I was referring to earlier. Sometimes it just doesn't matter. Linux sees things the way it wants and Windows sees it the way it wants. I was on EIDE and tried cable as well as master/slave settings on the drives. When it was all said and done, the only thing I could do was live with it and switch them around with lilo (lie to Windows in my case) to give it the drive order I intended. Why would it be any different for SATA?

One other thing. Linux doesn't normally care what's in the BIOS. Case in point: I wanted to put an obscenely large hard disk in a Pentium 100 to make a file/print server. My only option was Linux because the BIOS refused to see anything larger than 8 Gigs. Mind you this was years ago, but I had a 20 Gig drive and Linux had no trouble with seeing ALL of it. Reading large disk how-tos explained that Linux doesn't use the BIOS for these things. I used that server for 2 years. My point is that the BIOS is doing one thing for us in this case and that's handing us off the the MBR so the operating system of choice can take over. As far as disks go, this is all it's going to do for us.

Good info on the kernel. I guess that's where we have to go next. I can't see anything wrong with what you gave as output. Doesn't mean there's not, I just don't see it. Since the system doing a "kernel panic"? I'm thinking that it might be time to do a kernel recompile. And I'm inclined to get the latest sources from kernel.org for a 2.6 series kernel. I've been running it for a couple of months and it seems to be good. The performance is slightly less than the 2.4 kernel but it has some features I needed so I finally switched.

You can compile the kernel on any machine but I'm thinking you should do it on your install machine. And maybe even get the system up and running normally with one disk so you have some tools and things working before you start raid striping. Patrick Volkerding has config files available for all his kernels. So even if you're compiling a later 2.6 kernel you should start with his config. Then all you have to do is focus on changes or things you want to configure and leave the rest alone. See my other post here:

http://www.linuxquestions.org/questi...d.php?t=445443

Another option is to try to do it from the install media. You'll have to be careful and chroot but that will get a temporary system up for you while you're doing compilations. Don't forget to configure and re-run lilo when you're done. I'm notorious for getting it all done and forgetting that last little thing ;-) This isn't as bad as it seems. Let me know if you get stuck with anything.

vonst · 05-26-2006, 08:17 PM

Last night, I switched around my cables. Now, lilo.conf says boot=/dev/sdc and I install lilo with lilo -b /dev/sda. For the record, absolutely nothing changed. Well... I needed to completely recreate all my links for the new drive order. Windows runs on /dev/sda1 now.

In case anybody ever wondered about this: when you already have RAID0 and then scramble your disks, and put them out of order without modifying your raidtab file, you WILL NOT be able to make RAID. I wrote it like this to keep it from looking like I trashed my disks. What I ended up doing was rewriting raidtab to reflect the new disk order and then remounting /dev/md0 and reinstalling Slack 10.2. It still has the same problems.

I have only now learned how to use chroot. I am going to try to go into my RAID array and compile a kernel from there. (As I've already noted, I compiled the kernel on my "old" machine and transferred it to my "new" machine, and the exact same problem occurred.)

--vonSt

vonst · 07-04-2006, 03:52 PM

*** SOLUTION *** SOLUTION *** SOLUTION ***

Thanks to everybody that helped me figure out this complicated and logic-defying problem.

I found the solution in the old Software-RAID-HOWTO, while searching for instructions for and applications for mdadm.

I installed Slackware on a single partition, ran mdadm, and have had a persistent /dev/md0 since! Here's the code I used:

Code:

mdadm --create --verbose /dev/md0 --level=raid0 --raid-devices=3 /dev/sdb2 /dev/sdc2 /dev/sdd2

--vonSt

meetscott · 07-07-2006, 01:47 PM

You're welcome on what little help I offered. Now everyone can reference your solution. Nice going.