Need Boot help...BAD. RH 7.3

Tenover · 11-05-2003, 12:22 PM

Thanks for all the help! I just booted up with the RH CD and when I get check running processes, there's about 20 instances of "kjournald" running, 9 of which are "defunkt".....Is that right? Could this be what's causing my problem?

idaho · 11-05-2003, 12:35 PM

I did some quick googling and I found that kjournald processes are the journaling mechanism for your ext-3 file systems.

This is probably a symptom of your problem, rather than a cause. I am not sure how ext3 journaling reacts when your system is trying to deal with flaky HDD media.

Tenover · 11-05-2003, 12:42 PM

I'm at my wits end now. *Should* I be able to run a fsck.ext3 check on my / partition? I can't seem to. I get a "Fsck.ext3: Is a directory while trying to open / " error, followed by the superblock warning/error.....

idaho · 11-05-2003, 12:54 PM

Check your /etc/fstab file to see if / was mounted as a ext3 file system. It is possible that it was not.

There is some hoop jumping (use of an initial ramdisk during boot) that needs to be done to use a journaled file system on the root partition and some people prefer to keep things simple.

Tenover · 11-05-2003, 12:59 PM

I checked, and yes, / is mounted as an ext3 filesystem. I also ran a fsck.ext3 on / and it came back clean. I f I was to restore from a tape, should I just restore the entire / partition or can I get away with just restoring /boot or /etc or...?????

idaho · 11-05-2003, 01:03 PM

I would try just restoring the /boot, /etc, /sbin, and /usr/sbin directories. This should give you the files needed to complete the reboot without overwriting any data files. If this fails, then you can try a complete restore of the / partition.

Good luck.

Tenover · 11-05-2003, 01:47 PM

When I try to check the tape, I get a "tar: This does not look like a tar archive".....But it is. To backup, I simply use:

tar zpcf /dev/st0 * --exclude(a few directories...)

idaho · 11-05-2003, 02:26 PM

The "z" option in your tar archive command is compressing the the tar file. To extract the files you need to also use the "z" option, e.g.:
tar -xvzpf /dev/st0

Unless forced to by lack of storage capacity, for transparency reasons I prefer not to use compression with tar when creating backups.

FYI, here is an article from IBM developerworks about using a Knoppix bootable CD for Disaster recovery:
http://www-106.ibm.com/developerwork...noppixRecovery

Tenover · 11-05-2003, 03:06 PM

So, is there a way to "see" all the backed up files on my DLT tape?

idaho · 11-05-2003, 04:39 PM

Yes, use:
tar -tzf /dev/st0 | less

Tenover · 11-06-2003, 12:14 PM

Hey idaho.....I'm holding off on restoring because I think I found something that could be causing the problem, but I need your help figuring out what to do....
After the boot fails and I go into repair mode, I can do a mount -a and it mounts all my filesystems in fstab just fine except for /var. I can manually mount /var just fine. If I do a umount -a, all filesystems are unmounted just fine EXCEPT for /var. If I do a umount -a, it tells me that /var is not mounted, and then when I do a df -k it shows me that / and /var are mounted (and that's all), but both filesystems have the EXACT SAME stats (space, usage, etc....). Any ideas??? Like I said, /var is on it's own physical disk.

Tenover · 11-06-2003, 12:28 PM

Oh yeah, once I manually mount /var, the stats all change to reflect the proper values.....

Tenover · 11-06-2003, 12:52 PM

Well, mount and umount seem to be working ok now. I've put some echos into the /etc/rc.d/rc.sysinit file to see how far it gets, and it fails right at the Autocheck area.....Does .autofsck and .automount need to be in / ??

Tenover · 11-06-2003, 01:00 PM

Here is the exact location in the rc.sysinit file where my boot up stops with this error:
execvp: No such file or directory

It is in the rc.sysinitfile, and it's one of these two lines, because I have an echo at the start and beginning of these lines and I only get the first ech, then the error.

if [ -f /forcefsck ]; then
fsckoptions=" $fsckoptions"

Eqwatz · 11-06-2003, 03:08 PM

With the boot disk, mount "/" and check to see if there are entries in .autofsck. If there are any entries, mv to .oldautofsck, create an empty file.

You can change /etc/fstab to mount /var as ext2; I'm not positive, but you can change all of the entries for the file system to ext2--this way you are blowing off the journalling, but you may boot up that way cleanly. (I'm pretty sure you can just change the entries, because of "pivot_root" on initrd.)

Ok, my rc.sysinit on that line says:

if [ -f /forcefsck ]; then
fsckoptions='cat /fsckoptions' Remember I am running RedHat 9.
fi
if [-f /forcefsck ]; then
fsckoptions=" -f $fsckoptions"
elseif [ -f / .autofsck ]; then
echo $"Your system appears to have shut down uncleanly"
AUTOFSCK_TIMEOUT= 5
[ -f /etc/sysconfig/autofsck] && . /etc/sysconfig/autofsck
if ["$AUTOFSCK_DEF_CHECK= "yes" ]; then
AUTOFSCK_OPT = -f
fi
--------------------------------
More to come, I have to track down where it is getting /fsckoptions, I didn't find that with locate.

Okay, /etc/sysconfig/fsckoptions is an optional file. I read farther down and we have an enviromental variable AUTOFSCK_OPT, which you probably get the value of with "$fsckoptions".

A thing you might check, since you are crusing files from the command-line is /etc/sysconfig/harddisks. To see if the other guy tweaked the system for better performance. The file applies to all harddrives equally, and if one is "sick" it will throw everything off. The "safe" setting is having everything commented out with "#".

I put individual hdparm scripts, custom for each drive, elsewhere.

You have spent more time by far on this than doing a reinstall--you know that don't you?

It may sound goofy, but I am really lost on a hosed system unless I have it in front of me.

If I don't follow an exact procedure, and mentally check everything off as I go along, invariably I thrash it up even worse.

First, I check the logs, to see how long it has been since the hardware has been serviced. If it has been more than six months, I open up the case: I pull out and reseat all of the memory, the PCI adapters, and the jumpers on any WD hard-drives (I have a WD Caviar drive with a weird corrosion problem on the jumper pins--been that way since one year after I got it.); I blow all of dust out; I carefully remove and re-install the cables; and say nice things to the server while I do it (Weird? Yes. Does it seem to help? Yes--I'll swear to it on a stack of Bibles, but I don't know why. Maybe it's just me--nah, I caught another guy doing it.) Bada--bing! Many times that's all it was.

It doesn't rule out hardware, but at least I know it is not a connection anywhere with corrosion or dirt on it.

Then, I go on from there.

P.S. I did check the pins and the jumper in case one was tin and one gold, maybe it is gold-plated tin--I don't know. And yeah, I have some Caviar drives. Don't laugh too hard, okay?

P.S.S I had an el-cheapo cmd680-based ATA/133-UDMA IDE adapter that would work OK for a length of time, then would corrupt a file system. I got good at image restoration (I use Acronis self-booting media, it is 32-bit and fast as all get out. I cheat when I can.). It would happen anywhere from 2 weeks to 2 months apart. Because it wasn't consistent, it was difficult to track down. The operative word is "had".