LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (http://www.linuxquestions.org/questions/linux-software-2/)
-   -   Bizarre Server Booting Issues (http://www.linuxquestions.org/questions/linux-software-2/bizarre-server-booting-issues-4175427705/)

mr-roboto 09-17-2012 04:02 PM

Bizarre Server Booting Issues
 
My question pertains to a Ubuntu-based fileserver appliance called TurnkeyLinux. I'm posting here because this is a truly bizzare, generic Linux issue and I'm hoping to reach the widest number of eyeballs for a solution.

Server was working fabulously for months (two years in fact), then by accident, I discovered the boot hard drive had completely failed. Took the PC home over the weekend, replaced the boot hard drive, reinstalled the server software, and all should've been well, but wasn't.

To visualize the next part, one must know about the exact hardware config. Primary master is a simple CD-ROM, primary slave is the 15GB boot drive. The secondary master+slave are two (2) 500GB drives. Since these are all IDE drives, I believe everything is jumpered as cable-select.

After the TKL server appliance was installed, the PC wouldn't reboot into Ubuntu ! I began by retracing my steps till (hours later) I discovered the only procedure that would work:
  1. Startup PC. Only displays a cursor. No disc activity.
  2. Enter BIOS Setup and switch local IDE config to primary (to disable the Secondary controller.) Reboot.
  3. PC immediately loads GRUB and loads Ubuntu/TKL fileserver appliance, but somehow this is an unconfigured server. Reboot.
  4. Re-enter BIOS Setup and change local IDE config back to both controllers. Reboot.
  5. PC immediately loads GRUB and the Ubuntu/TKL fileserver appliance and customer's fileserver is operational again !

For reasons that I can't explain, the PC will not cold-boot into the TKL anymore. I stumbled on to the BIOS Setup workaround, when I physically disconnected the storage array (ie. the secondary IDE drives) and the PC booted normally. I probably would've solved this on my own, but I ran out of time that weekend and had to return the fileserver for the start of business on Monday. However, I rebooted the server a dozen times in a row, to make sure I had nailed the issue.

To add a new wrinkle, some time later (today, in fact) there was some sort of power failure at the office. I was able to bring up the server per the procedure enumerated above, but after I able to restart the fileserver, my PS/2 keyboard became completely unresponsive ! I can reboot the PC remotely, but my customer is starting to get impatient about this ongoing series of foul ups.

I suspect there's some kind of GRUB error, but GRUB is pretty much opaque to me. Also, I can't explain why the system drive (ie. /)is designated /dev/hdc1 and not /dev/hda1. There is a hda and hdb when I ls /dev. For every other Linux install I've done, I seem to recall the drives are always designed by primary master+slave, secondary master+slave order.

I just don't understand. Any help is welcome. TIA.....

Kenarkies 09-17-2012 08:02 PM

I can't say I can work out why, but the obvious statement is that it's trying to boot from a secondary drive. I never used cable select but always set the addresses directly, so there may possibly be something to do with the disk ordering. Anyway one workaround might be to install GRUB on the drive it's trying to boot from. Once logged in it's fairly straightforward - there are howto's around that explain the procedure.

Later thoughts - maybe it's not so obvious. This can happen if two drives have the same address, but that doesn't explain why it comes good later. It sounds rather more like a heat issue with the new drive, though unlikely. When booting from cold does the BIOS show all the drives correctly? If it can't see the new drive maybe it's stuck on the optical drive, although the BIOS settings should allow alternative drives to be tried. Depends on how old the motherboard is. It would probably be older than 2 years if it has all four IDE channels.

Ken

mr-roboto 09-17-2012 10:56 PM

Quote:

Originally Posted by Kenarkies (Post 4782800)
I can't say I can work out why, but the obvious statement is that it's trying to boot from a secondary drive. I never used cable select but always set the addresses directly, so there may possibly be something to do with the disk ordering. Anyway one workaround might be to install GRUB on the drive it's trying to boot from. Once logged in it's fairly straightforward - there are howto's around that explain the procedure.

@Ken: Thanx for the feedback. I normally set drives explicitly (ie. master/slave) as well, but that can't be it. All mfrs use CS mode by default, for ease of installation. However, I will check that they're all using the correct mode when I can.

Quote:

Later thoughts - maybe it's not so obvious. This can happen if two drives have the same address, but that doesn't explain why it comes good later. It sounds rather more like a heat issue with the new drive, though unlikely. When booting from cold does the BIOS show all the drives correctly? If it can't see the new drive maybe it's stuck on the optical drive, although the BIOS settings should allow alternative drives to be tried. Depends on how old the motherboard is. It would probably be older than 2 years if it has all four IDE channels.

Ken
The new boot drive ain't so new, was just something that worked and was hanging around (ie. free), but Will check that isn't overtemping anyway. The PC hasn't actually been off, for more than a couple mins in months. Actually, it's only been off for an extended period when in transit to/from my house.

Thanx again....

ghstridr 09-18-2012 12:33 AM

I have to agree Kenarkies about the issue of the drive/partitioning discovery. Best thing is to boot to a recovery cd, look in dmesg to see if the order of discovery is correct.
Next look at your fstab to see how partitions are being identified and mounted. I find that using UUID's is the surest way to enforce a particular partition to a specific mount point. Using labels is easier to read, but I have 1000+ physical servers (no, really) and had a recycled drive (ie. free) cause me problems because it contained a file system with labels that interfered with my new grub installation. So I recommend using UUID's in grub and /etc/fstab.

scrooge74 09-18-2012 08:14 AM

Just a weird idea, could it be the clock battery of the BIOS died on you? And that ends up messing your configuration after reboots or taking the power off the equipment?

mr-roboto 09-18-2012 08:42 AM

Quote:

Originally Posted by ghstridr (Post 4782967)
I have to agree Kenarkies about the issue of the drive/partitioning discovery. Best thing is to boot to a recovery cd, look in dmesg to see if the order of discovery is correct.
Next look at your fstab to see how partitions are being identified and mounted. I find that using UUID's is the surest way to enforce a particular partition to a specific mount point. Using labels is easier to read, but I have 1000+ physical servers (no, really) and had a recycled drive (ie. free) cause me problems because it contained a file system with labels that interfered with my new grub installation. So I recommend using UUID's in grub and /etc/fstab.

UUIDs. That's an interesting idea. Thanx, I'm on that that as soon I fix their phone system....

mr-roboto 09-18-2012 08:46 AM

Quote:

Originally Posted by scrooge74 (Post 4783245)
Just a weird idea, could it be the clock battery of the BIOS died on you? And that ends up messing your configuration after reboots or taking the power off the equipment?

Nah. Keeps time just fine. Drive params haven't changed. Thanx....

mr-roboto 11-19-2012 12:26 PM

Finally discovered the source of the problem: motherboard/firmware issues. Had nothing to do with the Linux software at all. Everything checked out physically, but still wouldn't work as expected. Since the original post, I've personally encountered other Compaq Presarios that have similar boot problems all related to BIOS/motherboard initialization. In one case, the PC powers up, but won't load an operating system. Clear the CMOS (by popping the battery) and it boots right into Windows, but won't reboot. It hangs at the cursor til you clear the CMOS again.

Mystery solved....


All times are GMT -5. The time now is 01:35 PM.