LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 11-03-2011, 02:30 PM   #1
hbutz
LQ Newbie
 
Registered: Jun 2011
Posts: 9

Rep: Reputation: Disabled
RedHat runs for 1-2 weeks then dies at the GRUB prompt


I have RedHat 9 running on a single partition with the GRUB bootloader. It's running in an embedded system running off a 2gig CF Compact Flash card. I boot it into a text shell and have disabled most of the unnecessary services. It is a "blind system" with no keyboard, no mouse, no video - although a video card is present and active. The 2.4.20-8 kernel is the highest version I can run because of this particular project.

For 1-2 weeks, the system boots. I have Telnet and FTP access. When I connect the video cable I can see GRUB displaying a single menu item then booting after a few seconds. I have disabled automatic hardware discovery, power management, anything which was not needed.

The disk is formatted ext3 and because there is no keyboard the system is always shutdown by yanking out the power cord. When the system boots up again, a message is displayed advising me that the system was shutdown uncleanly. Most of the time, the system just boots and runs a few background applications - writes to 3 log files in addition to the system logs which I manually rotate in the boot script.

Everything works great for a couple of weeks before it quits working. I hook up the video and keyboard to see the "GRUB " prompt. *NO* keyboard strokes are accepted. It's just dead and I cannot enter any commands at the GRUB prompt. The only thing I can do is re-flash the card to the default systems using SelfImage from a disk image file (clone it).

I looked at two failed CF cards and did some exhaustive tests. There are no read errors on the disks. They both mount without issue using another Linux computer on a USB card reader. I copied all the files off and did a file-by-file comparison. One disk had a few bad entries in the Korean local settings which is not used. The second bad disk had no bad file entries - assume this would be taken care of by the journal files. The /boot/ and grub directories are untouched. Aside from some log files and data files being updated, there does not appear to be anything which would cause GRUB to simply lockup. At this point the system isn't even running.

I did a binary comparison on the MBR and no bits have been flipped. There's a discrepancy between LBA and CHS - but, from what I've read Linux doesn't use the BIOS geometry anyway. I've done an e2fsck on all the disks and nothing is being reported.

my mtab table had been modified by something for usbdevfs, /dev/pts/ and /dev/shm/ which I'm guessing are scratch drives?

I'd like to explore some options other than re-partitioning and starting from scratch. I am able to chroot and run grub on the second drive. I'm guessing I could re-install grub, but if I did that I would have to repeat this every couple of weeks.

Is there anything else I can look at to diagnose why grub is hanging up before I re-flash the drive? I have full r/w access to the CF card. I just have no clue why grub won't run.
 
Old 11-04-2011, 08:35 AM   #2
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,659
Blog Entries: 4

Rep: Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941
So, obviously, the kernel crashed, and it tried to reboot, and there is something amiss with the GRUB configuration that keeps it from doing so.

You don't yet know what the root cause of the problem is. Nothing in this post suggests any root cause to me. Probably something external to the system itself causes it to fall-over, and maybe that "something" happens every two weeks or so. (Does a filesystem fill up, say, with logs?)

Don't just start "slapping at it." Diagnose the problem. Look for log entries. Look over your software maintenance diary or notebook to recall what you (or someone else) recently did to the system. Consider when it was known to be stable vs. when it started to become unstable. (And if you don't keep such records ... ... start now.)

Last edited by sundialsvcs; 11-04-2011 at 08:37 AM.
 
Old 11-16-2011, 03:13 PM   #3
hbutz
LQ Newbie
 
Registered: Jun 2011
Posts: 9

Original Poster
Rep: Reputation: Disabled
Thank-you for the reply. The system is an embedded system. Some logs files are written, three background tasks run, files are FTP'd into a user subdirectory - really low demand on the system at runlevel 3 with a single user plus root. The system runs off a CF memory card. When I install the card into another CPU board it hangs up in the same place at the GRUB message in Stage 1.

This is what I have learned. It's not getting stuck at the GRUB prompt in stage 2. It's stuck after Stage 1. I re-compiled GRUB 0.93 Stage 1 to add a few diagnostic messages via the BIOS text string output function. I needed to shorten a few messages, making them all 2 bytes long to squeeze my new diagnostic messages into the MBR. Then, I hacked in the update with a hex editor on the MBR in-between the BIOS constants and the partition table. After the dust settled, I see "GRUB " then "D1" and "D3" which are my diag strings. GRUB Stage1 is running fine all the way up to the "jmp *(stage2_address)" where it goes onto whatever it read into RAM from the disk, which should be Stage2. The LBA address stored in the MBR @offset=0x44 for Stage2 is 2A0CD, assuming I read that right. Since I didn't see my "D2" message I know it's not falling back to CHS mode.

I can read the CF card just fine and did a binary compare of the whole /boot directory and don't see anything obvious. The processor is jumping off the cliff between Stage1 and Stage2 and it's unlikely that I can figure out where without ICE.

For my next trick, I'll mount the drive on another system, chroot and re-install GRUB to see if a) The LBA address of Stage 2 in the MBR moves and b) Will it finally boot? If so, I'll do another compare on the whole drive to see what's different.

Haven't see anything in the logs to clue me into a failure which prevents Stage 2 from loading.
Attached Thumbnails
Click image for larger version

Name:	bootsector_debug.jpg
Views:	9
Size:	59.4 KB
ID:	8393  
 
Old 11-16-2011, 06:06 PM   #4
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,127

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
I know nothing about embedded systems - never looked at it. Base any value of the following on the previous sentence.

IIRC grub stage1/1.5 embeds the (logical) device as well as LBA offset. I'd be guessing you are "losing" the device - I've seen situations where a USB will take an interrupt (power ?) and udev will re-drive and the USB will come back as another device entirely. Not sure how/if that could happen to a boot device, but if it did things would sure get ugly in a hurry. But for grub to fail (like that) on a re-boot, the change in device id would need to be "(semi-)permanent".
Stranger and stranger said Alice ...
 
Old 11-17-2011, 07:20 AM   #5
hbutz
LQ Newbie
 
Registered: Jun 2011
Posts: 9

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by syg00 View Post
I know nothing about embedded systems - never looked at it. Base any value of the following on the previous sentence.

IIRC grub stage1/1.5 embeds the (logical) device as well as LBA offset. I'd be guessing you are "losing" the device - I've seen situations where a USB will take an interrupt (power ?) and udev will re-drive and the USB will come back as another device entirely. Not sure how/if that could happen to a boot device, but if it did things would sure get ugly in a hurry. But for grub to fail (like that) on a re-boot, the change in device id would need to be "(semi-)permanent".
Stranger and stranger said Alice ...
My thinking was along the same lines. An embedded system just lacks a keyboard, mouse, monitor, and speaker - so, it's a black box which is always powered off without shutting down properly. re-plugging a USB device sometimes confuses the OS's enumeration - something about VID's, HID's and PID's which is resolved by a reboot.

The problem stays with the flash drive even through power-off. I thought for certain the MBR was being nuked. But, after a binary comparison of the MBR and spelunking GRUB's Stage1 everything looks fine. I thought the BIOS was forgetting what type of device it was, but the path through Stage1 never reaches an error; it just stops. It loads [something] from the disk which it *thinks* is Stage2, but I don't have the tools to see what it's reading from that sector. I can't modify Stage2, since once I save the changes it will be stored in a different physical LBA address than what's stored in the MBR. ugg.

I've considered flash memory wear balancing, but it can't possibly be changing the location of files else everything would stop working. Unfortunately the flash device appears like an IDE drive which is more susceptible to data corruption than SATA. If I knew *why* it wasn't booting I would have a better handle on how to fix it.
 
Old 11-17-2011, 08:31 AM   #6
hbutz
LQ Newbie
 
Registered: Jun 2011
Posts: 9

Original Poster
Rep: Reputation: Disabled
Ok, now I know exactly what the system is doing but not why. MBR loads GRUB stage 1. GRUB Stage 1 loads LBA address 2a0cd7 into memory and jumps to it, which runs Stage 2. Everything is fat, dumb and happy.

But, then for some unknown reason, the sector returned in response to loading 2a0cd7 is changing. When I mount the drive under Windows and use Hex Workshop I can read Stage 2 at that address just fine. But, when I put the "bad" CF drive on I read another part of memory entirely.

Using another program, Explore2fs, I can read stage2 from both the "bad" and good drives. Explore2fs reports the same blocks used. And, I can read the "bad" file just fine when going through the file system, but not when I attempt to access it via the LBA address. I'm thinking something has changed the LBA address or the BIOS is returning the wrong data?

Very weird. I'm just happy that I know what it's doing. Now, moving onto why is it doing it?
Attached Thumbnails
Click image for larger version

Name:	good_stage2.gif
Views:	8
Size:	8.9 KB
ID:	8400   Click image for larger version

Name:	corrupted_stage2.gif
Views:	7
Size:	9.0 KB
ID:	8401   Click image for larger version

Name:	before.gif
Views:	6
Size:	44.0 KB
ID:	8402  
 
Old 11-17-2011, 08:32 AM   #7
hbutz
LQ Newbie
 
Registered: Jun 2011
Posts: 9

Original Poster
Rep: Reputation: Disabled
This is after the corruption (attached)
Attached Thumbnails
Click image for larger version

Name:	after.gif
Views:	5
Size:	48.7 KB
ID:	8403  
 
Old 11-18-2011, 06:49 AM   #8
hbutz
LQ Newbie
 
Registered: Jun 2011
Posts: 9

Original Poster
Rep: Reputation: Disabled
Lightbulb

I see what's going on. The CF card is returning different data for the same LBA address after several power cycles, which all happen without shutting down properly. The only thing which might possibly do this is the flash memory wear balancing algorithm which is apparently scrambling the association between the logical and physical sectors. ...unless some type of background disk defrag is moving the first block of the file, but I can't see that. CF cards are susceptible to sudden power loss during writes - so, falling back on an old cliche, this is a "hardware problem."
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
grub.conf isn't being read; I'm being thrown into a grub command prompt each time punt Linux - General 3 05-31-2011 04:56 PM
grub drops to grub prompt on boot, but can use commands to find menu... fix? greenmuzz Linux - Software 7 06-07-2009 03:07 AM
Booting redhat always go to grub prompt bsaputra Linux - Newbie 4 12-06-2006 09:33 AM
command runs at prompt not in script newbie_m Linux - Newbie 2 01-20-2005 02:12 PM
Why grub prompt after installing Grub inder Debian woody 3.0r2? velan Debian 1 04-20-2004 04:55 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 03:27 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration