How are alternative superblocks fixed?

lotuseclat79 · 09-19-2007, 08:32 AM

The following problem occurs on an FC3 filesystem being used for storage (i.e. it gets mounted with -o sync, data is saved, then it gets unmounted). Note: FC3 OS on the hard drive on which the FC3 filesystem exists is not booted up and running, but power to the computer system puts the hard drive into a ready state. The FC3 filesystem is an ext3 journaled filesystem.

Using dumpe2fs, I have noticed that the primary superblock is the only clean superblock - even after this AM booting up after 32 mounts (2 over the 30 maximum mount count limit) which causes fsck to run. This means that 12 alternative superblocks record a "not clean" status of the ext3 filesystem even after fsck has been run. This does not seem to be what should, but just is as it is.

This has me concerned, naturally, because none of the alternate superblocks are usable, and when the primary goes south they cannot be used to fix the problem.

The Problem

What happens occasionally is that the ext_attr Filesystem (FS) feature goes missing from the primary superblock - I'm not sure just why as yet. Obviously, this would be the perfect opportunity to take advantage of a "clean" alternative superblock that also contains the missing ext_attr Filesystem feature. When this happens, the journal is not applied on the mount command from another Linux system (i.e. that OS is booted up and running) in the computer system. The current way I deal with this situation is to either boot up the FC3 OS or another Linux system which does not appear to manifest the missing ext_attr FS feature. Note: When this occurence happens, the Filesystem state is still marked clean in the primary superblock, however, the missing ext_attr causes the journal to not be applied on a mount. Also, all of the "not clean" alternative superblocks do not contain the ext_attr Filesystem feature.

Potential fix for the problem

Since I know where the alternate superblocks are located (by running the dumpe2fs command and piping the output into a grep command to look for the string "superblock"), it seems that the dd command can be used (very carefully) to overwrite the "not clean" alternative superblocks (from the primary superblock) when in fact the primary superblock is clean and an fsck has just been run - and, most importantly, the ext_attr has not gone missing as far as the primary superblock is concerned. This allows the journal to be applied first to complete a successful mount by the other Linux system as opposed to when the ext_attr goes missing which results in the journal not being applied.

What command can be or is typically used to repair the alternative superblocks (e.g. debugfs) i.e. to make them consistent with a clean primary superblock - or, have I stumbled upon how (by use of the dd command) in the process of thinking about this problem and how to resolve it? One would think that when a primary superblock is fixed by fsck, that the alternative superblocks would be updated - but, this does not seem to be the case with FC3.

Here is an example of using the dd commands to accomplish the task of repairing the alternative superblocks (i.e. only the 1st alternative superblock):

For the purpose of this example, here is a truncated list of the primary and 1st alternative superblocks from the output of the dumpe2fs command:
Primary superblock at 0, Group descriptors at 1-5
Backup superblock at 32768, Group descriptors at 32769-32773

Given: FS blocksize=4096; primary superblock at=0; 1st alternative superblock at=32768 and size of superblock=1024 <=== Is this correct??? Hard drive is 80GB SATA

To copy the 1st backup superblock (assuming it is clean) to fix primary superblock:
# dd if=/dev/sdbn of=/dev/sdbn bs=1024 skip=32768 count=1

To copy the primary superblock (assuming it is clean) to fix the 1st backup superblock:
# dd if=/dev/sdbn of=/dev/sdbn bs=1024 seek=32768 count=1

Note: /dev/sdbn is replaced by the actual device name where 'n' is some number: 0, 1, 2, ...

A few last questions:

1) Will the dd commands work and not bork up the good primary superblock - have I got the right superblock size=1024?

2) Is there another way to do this, say with the debugfs command?

3) Is it not advisable to use the dd command (as above) to overwrite the alternative superblock(s) from the primary superblock?

4) Is this a bug in FC3 or any subsequent Fedora release?

5) If not, then why does the fsck command not repair or update the alternative superblocks when the primary superblock is fixed?

6) Is that an fsck.ext3 command bug/omission - i.e. that the alternative superblocks are not updated when the primary superblock is fixed by fsck.ext3?

-- Tom

P.S. I have recently been issuing three sync commands after all of the data transfers have been made to the successfully mounted FC3 file system (i.e. journal has been applied) before unmounting it. Hopefully, this may help to ameliorate the problem.

P.P.S. For more information about the symptoms of this problem and the
other Linux environment in which I operate look at this thread.

jailbait · 09-19-2007, 11:41 AM

Quote:

Originally Posted by lotuseclat79

P.S. I have recently been issuing three sync commands after all of the data transfers have been made to the successfully mounted FC3 file system (i.e. journal has been applied) before unmounting it. Hopefully, this may help to ameliorate the problem.

P.P.S. For more information about the symptoms of this problem and the
other Linux environment in which I operate look at this thread.

If you have fixed the original problem of sync not working with ext3 then I suggest that you try to correct the bad superblocks by copying the entire file system to a new partition, format the partition with the bad superblocks, and copy your data back to the reformated partition.

-----------------
Steve Stites

lotuseclat79 · 09-19-2007, 02:17 PM

Quote:

Originally Posted by jailbait

If you have fixed the original problem of sync not working with ext3 then I suggest that you try to correct the bad superblocks by copying the entire file system to a new partition, format the partition with the bad superblocks, and copy your data back to the reformated partition.

-----------------
Steve Stites

Hi Steve,

I am not altogether certain that the "original problem" as you put it is due to sync not working with ext3. The use of the sync option on the mount command was adopted only some time after the first occurence of the problem and probably not immediately after, perhaps a couple of events later after I started to look into details more closely. Note: I am aware that there was a sync option problem with the mount command at some time in the past (not sure in what version it was fixed).

Admittedly, I did not issue sync commands after writing data to the mounted filesystem when I first experienced this problem (of losing the ext_attr Filesystem feature which I only discovered recently as a major symptom of the problem), but now I do which flushes the file system buffers prior to unmounting the filesystem. I am hoping that keeps the primary superblock intact until I can figure out the best way to correct the alternate superblocks.

The situation as it now exists poses another serious question since the alternate superblocks are now borked and the primary is the only good superblock. This situation represents a deviation from the assumptions in the kernel regarding the following comment in unix.c of e2fsck source code:

/*
* We only update the master superblock because (a) paranoia;
* we don't want to corrupt the backup superblocks, and (b) we
* don't need to update the mount count and last checked
* fields in the backup superblock (the kernel doesn't update
* the backup superblocks anyway). ...
*/

Not wanting to corrupt the backup superblocks (for reasons of paranoia) holds as a viable strategy only if the situation is reversed from what it now is in this situation. I assume there are motivations related to performance of the kernel in not even checking the state of the alternate superblocks. Since the kernel never even looks to see if the backup superblocks are borked - then the assumption that they are always good, is a mistaken assumption over time, and therefore potentially a serious bug in the kernel (as if anyone would ever admit that, eh?). What good are alternate superblocks if they all become borked? Clearly, the rationale with regard to these assumptions need to be revisited in the kernel so that an alternate strategy can be applied to maintain stability in the alternate superblocks and consistency with the primary, otherwise, the reliability of the filesystem and the kernel is weakened.

I assume there is a way to fix the backup superblocks with the debugfs command, but do not really know if it is up to the task. It looks like I'll have to look at the source code to make that determination unless you know and can tell me how.

I understand your recommendation as:
1) create a new partition and then do a mkfs to build an ext3 filesystem in that new partition with the same Filesystem features: has_journal ext_attr filetype sparse_super (which should in the process also creates new clean alternate superblocks (is that correct?).
2) copying the entire file system to the new partition,
3) reformat the old partition with the bad superblocks, and
4) copy your data back to the reformated partition.

What tool(s) do you recommend for the filesystem copy? Since the disk is a bootable disk with an installed OS, then it would be essential to preserve the MBR, etc. In that case tar and cpio would probably not be sufficient. I assume dd would just preserve the corrupted alternate superblocks and write over the newly created clean alternate superblocks if used (is that correct?). Please correct me if I am wrong in these assumptions.

The hard drive on which the ext3 filesystem resides is an 80GB SATA, with geometry (9729 cylinders/255 heads/63 sectors/track) where the starting cylinder of the filesystem is 14, the ending cylinder is 9538 and the size in MB of the filesystem is 74716. Unfortunately, I do not at this time have the resources (disk) available to accomplish this task.

Thanks for your reply though, I do appreciate your suggestion even though I may be a bit foggy on the details of exactly how to make that happen error free.

-- Tom

P.S. I am not sure this is a kernel bug, but if there is a kernel filesystem wizard that can take a look at this problem and decide one way or the other, that would be a good thing either way.

jailbait · 09-20-2007, 12:38 PM

"Since the disk is a bootable disk with an installed OS, then it would be essential to preserve the MBR, etc."

No, the MBR is not part of any file system. You do not have to mess with the MBR to solve this problem.

"What tool(s) do you recommend for the filesystem copy?"

I recommend tar. My second choice would be cp.

"Since the kernel never even looks to see if the backup superblocks are borked - then the assumption that they are always good, is a mistaken assumption over time, and therefore potentially a serious bug in the kernel (as if anyone would ever admit that, eh?). What good are alternate superblocks if they all become borked? Clearly, the rationale with regard to these assumptions need to be revisited in the kernel so that an alternate strategy can be applied to maintain stability in the alternate superblocks and consistency with the primary, otherwise, the reliability of the filesystem and the kernel is weakened."

I think that fsck is the software that needs to be looked at, not the kernel. If you substituted fsck for kernel in your paragraph that I quoted then I would agree with the paragraph.

"P.S. I am not sure this is a kernel bug, but if there is a kernel filesystem wizard that can take a look at this problem and decide one way or the other, that would be a good thing either way."

I think that the bad superblocks are probably a result of a bug in the kernel where the kernel mishandles the situation where an ext3 file system is mounted with the sync option. If fsck failed to detect the invalid superblocks then that might be a bug in fsck.

What did you do to get around sync not working correctly with ext3? If you haven't got that fixed then the bad superblock problem will probably recur.

Once you have the original problem fixed then you can straighten out the bad superblocks once and for all. You might be able to fix the bad superblocks by copying superblocks around but reformatting the partition is a much more reliable way to do it.

"The hard drive on which the ext3 filesystem resides is an 80GB SATA, with geometry (9729 cylinders/255 heads/63 sectors/track) where the starting cylinder of the filesystem is 14, the ending cylinder is 9538 and the size in MB of the filesystem is 74716. Unfortunately, I do not at this time have the resources (disk) available to accomplish this task."

How do you back up this partition?

---------------------
Steev Stites

lotuseclat79 · 09-21-2007, 11:00 AM

Hi Steve,

I did not assume that the MBR was a part of the filesystem, only that the MBR would need to be preserved for any new partition that would replace the existing one since it is a bootable hard drive.

The code comment out of e2fsck (included in my previous comment) seems very telling about the assumptions made by both the kernel and fsck.

I am not quite certain of the genesis (origin) of the ext_attr going missing in the first place. I am certain that I did not use the sync option on the mount command until after the first or maybe third occurence of the problem. The initial incident may have been a power outage earlier this summer.

I have no reason to believe that fsck not fixing the alternate superblocks when they clearly are not clean is a problem with the kernel, but, more likely fsck. From the code comment it is clear that fsck needs to have the functionality to repair the alternate superblocks or it may be that e2fsck may already accomplish that by being invoked with the force and preen options: # e2fsck -fp /dev/sdb2.

Do you know if that will work? Anyone?

-- Tom