[SOLVED] Constant .tar.bz2 data corruption

Sum1 · 11-21-2009, 12:13 PM

Hi Group,

About a month ago I noticed several problems with my backups on my Samba server. There's 2 HD's on the server: /dev/sda holds the live data accessed by the users, and /dev/sdb holds all the foo.tar.bz2 daily backups. /dev/sdb is mounted as "/archive" in /etc/fstab.

Both HD's use the ext3 filesystem and specify filesystem parameters: defaults,noatime,data=writeback

1.I've fixed the root filesystem on /dev/sda and it's running fine.

2.I've written zeros to /dev/sdb using dd and freshly created the ext3 filesystem.

3.I have a bash scipt that runs daily as a cron job and it creates a .tar.bz2 of all the user data on the samba server. Seems to execute fine and squashes 20 gigs. into a 12 gig. file.

4.PROBLEM: Every test of every bzipped archive fails. When I try to decompress and unarchive the data, there's error messages about corrupted data, and maybe it can be recovered using bzip2recover. Relying on bzip2recover for all my user data is not a good long-term plan.

So for now, I'm simply making nightly copies of the live user data to /dev/sdb , but I definitely miss the space savings afforded by bzip2 compression.

Is there anything I can try to further test why the compressed data gets corrupted?

Thanks for your help.

MS3FGX · 11-21-2009, 12:46 PM

Nothing in your post is enough to identify the problem. Though you might want to look at a more robust filesystem then EXT3.

Do the kernel or system logs mention anything while the script runs? Can you post the script here in CODE tags so we can look it over for a possible glitch? Matter of fact, has the script ever worked as intended, or have you been able to run it on a different system?

Sum1 · 11-21-2009, 04:46 PM

Quote:

Originally Posted by MS3FGX

Though you might want to look at a more robust filesystem then EXT3.

What would you suggest?

Quote:

Do the kernel or system logs mention anything while the script runs?

No.

Quote:

Can you post the script here in CODE tags so we can look it over for a possible glitch?

Here it is, very simple stuff:

#!/bin/bash
echo "1. Make a date-stamped storage directory."
cd /archive
dirname=`date +"%Y%b%d"`
mkdir $dirname
cd $dirname
#
echo "2. Build archive and place into storage directory."
tar -cjf foo.tar.bz2 /foo
#
echo "3. Completed."

Quote:

Matter of fact, has the script ever worked as intended, or have you been able to run it on a different system?

The script starts, runs, and completes with no errors every night and produces the desired .tar.bz2 file. The only problem is once I unpack the .tar.bz2 file, it fails half way through with errors.

Thanks for your help, MS3.
I'm ready to follow up on any suggestions you may have.

chrism01 · 11-22-2009, 11:44 PM

You could try gzip, it has an option to specify how much compression you want, on a scale of 1-9. Obviously deeper compression is slower. bzip2 has the same option.
http://linux.die.net/man/1/gzip
http://linux.die.net/man/1/bzip2

Sum1 · 01-09-2010, 04:21 AM

I need to follow up on this thread and seek help from the group.
I continue to have problems with tar'ed backups.

Since last post, I have stopped using any form of file compression and create .tar files every night.

My bash backup script remains mostly the same, except now I simply do:

tar -c --atime-preserve -p -f foo.tar /foo

The result is a .tar file of about 22 gigs.

But when I test extraction of data (tar -xf foo.tar), I receive the following error:

tar: Skipping to next header
tar: Exiting with failure status due to previous errors

and then when I look in the extracted /foo directory I only see between 5 - 10 gigs. of data.

I don't know where to look to uncover the problem.
Nothing in /var/log indicates a problem with creating the .tar file, as far as I can tell.

unSpawn · 01-09-2010, 05:49 AM

Quote:

Originally Posted by MS3FGX

you might want to look at a more robust filesystem then EXT3.

If you search LQ you'll find ten times more problems with for example Reiser than with Ext. Leaves me wondering what your definition of "robust" here would be?..

Quote:

Originally Posted by Sum1

About a month ago

What changed in the system at that time software and configuration-wise?

Quote:

Originally Posted by Sum1

I've fixed the root filesystem on /dev/sda and it's running fine.

What exactly happened to require fixing the filesystem?

Quote:

Originally Posted by Sum1

I have a bash scipt that runs daily as a cron job and it creates a .tar.bz2

Which user does it run as?

Quote:

Originally Posted by Sum1

But when I test extraction of data (tar -xf foo.tar), I receive the following error

Can you at least list contents with 'tar -vtf foo.tar'? And of a compressed tarball? If it fails to complete listing contents then at what point (file or dir) does it fail and could you verbosely list the contents of the directory here? (Please use BB code tags.) Does it happen with small tarballs as well? And does, like chrism01 suggested, gzip work for you? Did you ever find traces of memory corruption in other processes? Does the machine have enough RAM and swap? You list mount flags. What happens if you remount with only "defaults" and test again? Else how about meanwhile using 'rsync' between the two disks? Please note there's thirteen questions here. You may or may not be able to answer them all but being as verbose as possible is good: the more information the better.

Sum1 · 01-09-2010, 10:58 AM

UnSpawn,
First and foremost, thank you so much for your excellent response -- it's like a guide path.

Quote:

Originally Posted by unSpawn

What changed in the system at that time software and configuration-wise?
What exactly happened to require fixing the filesystem?

Exactly this: http://www.linuxquestions.org/questi...gnose.-763829/

Quote:

Which user does it run as?

root

Quote:

Can you at least list contents with 'tar -vtf foo.tar'? And of a compressed tarball?

I'm logged in via ssh to the server as I write.
Creating a new foo.tar, and will try 'tar -vtf'.
Results: Two attempts and two failures.
"tar: Exiting with failure status due to previous errors"
Interesting note here, the failure occurred in the exact same place in the directory system both times. I can look at that further -- move this sub-directory to a different partition and try process again, etc.

Quote:

Does it happen with small tarballs as well?

No, I've recently tried a few .tar files about 1 Gig. in size (using data from within /foo) and those were successfully tar'ed and extracted without problems.

Quote:

And does, like chrism01 suggested, gzip work for you?

Will test .tar first, and then move on to gzip and bzip.

Quote:

Did you ever find traces of memory corruption in other processes?

Definitely not sure how to look for, or determine this.

Quote:

Does the machine have enough RAM and swap?

I believe so -- 4 gigs. of DDR2 800 RAM but only 3 gigs. of it is recognized due to using 32-bit Slackware 12.2 on this box. Have 2 Gigs. of Swap and it seemingly never gets used -- no matter what process is running, 'top' always reports 0k used for Swap. The server has no more than 30 users at any given time.

Quote:

You list mount flags. What happens if you remount with only "defaults" and test again?

I will plan an evening to give this a try. Server access is relied upon 7 days a week from 7 am - 7 pm.

Quote:

Else how about meanwhile using 'rsync' between the two disks?

I've heard about this and maybe it's time to try it. If I can make separate daily "syncs" equivalent to these .tar files, then I'll gladly opt for it. I need to RTFM along these lines.

GooseYArd · 01-09-2010, 11:17 AM

Hey Sum1,

There's definitely a problem with either the filesystem or the drive. If you're getting that kind of error with bzip2, you'll get it with any other compression program.

Is the kernel logging any filesystem or scsi/ide errors? I would expect an IO error to derail the tar as it was writing, but it's worth checking.

Next, I would rule out a bug in ext3. Use ext2fs on the drive receiving the tarball and see if the problem persists. If it does, I'd junk the drive.

This is probably a dumb question, but I assume you're running a 2.6 kernel with large file support? If not, or if you have an old glibc that doesnt support LFS, then bzip2 will receive a SIGFX once it writes 2gb of output, which would cause it to fail similarly to what you described. LFS has been around for a long time, so I doubt thats the problem, but it can happen.

Sum1 · 01-09-2010, 12:33 PM

Quote:

Is the kernel logging any filesystem or scsi/ide errors? I would expect an IO error to derail the tar as it was writing, but it's worth checking.

Mr. Goose :-)
Thanks too for your help.

I'm not sure if I'm looking in the right logs, or whether I have activated logging the right stuff...but I've checked through /var/log/messages and dmesg and syslog, and I can't find any error messages relating to IO activity.

Quote:

Next, I would rule out a bug in ext3. Use ext2fs on the drive receiving the tarball and see if the problem persists. If it does, I'd junk the drive.

I like the thinking and I'm beginning to suspect the drive itself since I just wiped it with zeros, installed the partition and ext3 filesystem only a month ago.

I'll blend your suggestion with UnSpawn's:
remount with ext3 defaults and test;
rebuild with ext2 and test;
install a different hard drive altogether and test.

Quote:

I assume you're running a 2.6 kernel with large file support? If not, or if you have an old glibc that doesnt support LFS, then bzip2 will receive a SIGFX once it writes 2gb of output, which would cause it to fail similarly to what you described.

Currently, using kernel 2.6.30.4
I checked my kernel config and it does show "Support for large block devices and files" built into the kernel.

Thanks again for your help.
I've got quite a bit of testing to do.

unSpawn · 01-11-2010, 11:49 AM

Quote:

Originally Posted by Sum1

Exactly this: http://www.linuxquestions.org/questi...gnose.-763829/

Auch. Unfortunately the thread doesn't show you determining and fixing what was wrong.

Quote:

Originally Posted by Sum1

"tar: Exiting with failure status due to previous errors"

Sometimes noting the error value ('tar --do-Something; echo $?') might help.

Quote:

Originally Posted by Sum1

Interesting note here, the failure occurred in the exact same place in the directory system both times. I can look at that further -- move this sub-directory to a different partition and try process again, etc.

Let us know.

Quote:

Originally Posted by Sum1

No, I've recently tried a few .tar files about 1 Gig. in size (using data from within /foo) and those were successfully tar'ed and extracted without problems.

Could try running tar through 'split' to come up with chunked archives?

Quote:

Originally Posted by Sum1

Definitely not sure how to look for, or determine this.

Unexplainable crashes, applications failing?

Quote:

Originally Posted by Sum1

I've heard about this and maybe it's time to try it. If I can make separate daily "syncs" equivalent to these .tar files, then I'll gladly opt for it. I need to RTFM along these lines.

...and search LQ. We've definitely got some threads on rsync. It isn't hard to use.

Quote:

Originally Posted by GooseYArd

This is probably a dumb question, but I assume you're running a 2.6 kernel with large file support? If not, or if you have an old glibc that doesnt support LFS, then bzip2 will receive a SIGFX once it writes 2gb of output, which would cause it to fail similarly to what you described. LFS has been around for a long time, so I doubt thats the problem, but it can happen.

I always thought LFS was a kernel 2.4 thing?.. BTW there is a 16 GB file-size limit if ext3 uses a 1 KB block-size, but the default is 4 KB anyway...

Quote:

Originally Posted by Sum1

I've got quite a bit of testing to do.

Let us know how it's going, OK?

Sum1 · 01-16-2010, 09:18 AM

Quote:

Originally Posted by unSpawn

Auch. Unfortunately the thread doesn't show you determining and fixing what was wrong.

Quote:

Let us know how it's going, OK?

I believe I can mark this thread "Solved."

1. Test Results

After multiple, many, repeated tests, using both server HD's in the box, I can report the following: regardless of tar, tar + gzip compression, or tar + bzip compression, there's always 2 or 3 corrupted areas of data that produce fatal errors when trying to recover/unpack the contents of the .tar. Depending on which of the 3 archiving methods employed, the errors are reproduced in different places in the data set.

2. Conclusion - (best efforts of deduction)

I must have committed an error while using tune2fs back in July 2009.
In another thread, I reported:

Quote:

I started with this ext3 setup in /etc/fstab:
/dev/sda2 / ext3 defaults 1 1
I changed /etc/fstab to:
/dev/sda2 / ext3 defaults,noatime,data=writeback 1 1
And then executed command on root partition:
tune2fs -o journal_data_writeback /dev/sda2

Works.
No data loss.
No problems.

In doing so, I may have made an error in stating a tune2fs command. Or possibly, I may not have properly unmounted the partition/filesystem prior to executing tune2fs commands. I may have remounted the partition in another terminal and forgot about it while executing commands in a different terminal. I'll never know for sure, but it seems like the only logical answer.

It would seem highly unlikely that both my server hard drives are failing. I have 30 users reading/writing to them no less than 12 hours a day, and I have not received any comments/complaints about lost files or inability to access files, corrupted files.....nothing at all. Believe me, they are not shy, and would gladly let me know of such occurrences. <grin>

I feel fortunate it's not a whole lot worse - 99.9% of the data is not corrupt. I've been backing up the data nightly by way of 'cp -p -r /foo /archive/date-stamped-directory/foo'. And then I run a bash script I made to diff and compare the copied files and directories in multiple ways.

Eventually, I'll have to delete all partitions and create new ones with cleanly configured ext3 or ext4 file systems.

UnSpawn, I truly appreciate the solid guidance and prompts to help me work through it in a logical way.