LinuxQuestions.org - Compression help

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Compression help (https://www.linuxquestions.org/questions/linux-newbie-8/compression-help-834530/)

dimpu

09-26-2010 01:51 AM

Compression help

Friends,

I was wondering if some one can help me out with my problem.

I have a C program which outputs a text file test.out which contains hexadecimal addresses. The file size is a little more than 100GB. The problem is I don't have enough space to store this file on my hard drive. The space I have on my drive is only 72GB. Can anyone tell me how to compress this file on the fly so that I can store this on my hard disk?

Again, after I get this file test.out in the compressed format like in .gz or .bz2 I want to use this as an input to a shell script..for example ./simulator < test.out.gz. I will be really grateful if someone helps me out with this as well.

Thanks

neonsignal

09-26-2010 02:28 AM

You could just pipe it through gzip:

Code:

./test | gzip -c >test.out.gz

And then recover using

Code:

gunzip -c <test.out.gz | ./simulator

(the '-c' flag is just so that gzip will use stdin/stdout).

chrism01

09-26-2010 07:05 PM

An alternative cmd to read the file is zcat

dimpu

09-26-2010 11:47 PM

Compression help

Quote:

Originally Posted by neonsignal (Post 4109222)

You could just pipe it through gzip:

Code:

./test | gzip -c >test.out.gz

And then recover using

Code:

gunzip -c <test.out.gz | ./simulator

(the '-c' flag is just so that gzip will use stdin/stdout).

Thanks Neo and Chris for your promptness, but the problem is my C program says the output will be test.out (this is stated in the C code) and when I pipe the way you suggested I get both the file, the test.out and the testing.out.gz. The test.out shows all the required stuff but the testing.out.gz just says "done..." and shows no text (the hexadecimal address I am looking for).

May I know why is this happening?

neonsignal

09-27-2010 12:05 AM

Quote:

Originally Posted by dimpu (Post 4109895)

The test.out shows all the required stuff but the testing.out.gz just says "done..." and shows no text (the hexadecimal address I am looking for). May I know why is this happening?

Redirection (using the '|') only affects the standard output of the program. Because your C program writes directly to a file, it has no output, so nothing goes into test.out.gz.

Ideally you should change the C program to send its output to stdout. If you cannot do that, you can work around it in the following way:

1. Set up a named pipe to replace the normal output file:

Code:

rm test.out

mkfifo test.out

2. Have the compression program ready at the end of the pipe (the ampersand places it into the background so that you can keep using the shell):

Code:

gzip -c <test.out >test.out.gz &

3. Run the C program into the pipe:

Code:

./test

dimpu

09-27-2010 01:10 AM

Compression Help

Neo,

Thank you very much. You are just great and a wonderful person. I thought I would end-up with no solution for my question but you did it all! Kudos to you and you deserve it.

If you have time please let me know what this mkfifo does. I ran a test program and everything works just fine. I am ready to run few large programs and hopefully they'll give me similar results.

I saw something wierd happening though. When I type the command mkfifo test.out and run I don't see that the memory is consumed but I do see that the output of my C program, if it's test.out is greyed out and then when I use the head command to look at the content of test.out (the grayed) it doesn't show anything but the zipped file does have everything.

I use df -h to know the amount of memory being consumed.

neonsignal

09-27-2010 01:34 AM

Quote:

Originally Posted by dimpu (Post 4109940)

let me know what this mkfifo does

The mkfifo creates an object that can be used as a data pipe from a source to a sink. Although it looks like a file, it exists on the file system as a name only, and does not use up any other space. Any data put into it by one process goes into a memory buffer, and is consumed by the process that you set up at the other end.

Quote:

Originally Posted by dimpu (Post 4109940)

when I use the head command to look at the content of test.out it doesn't show anything but the zipped file does have everything.

In the example, the gzip has already grabbed the data out of the pipe, so head does not see anything inside test.out (remember that test.out is now a pipe object, not a file).

dimpu

09-28-2010 01:15 AM

Compression Help

Neo, I want to know how many minutes/ hours should a benchmark (you know, they are huge programs)take to run using your technique. I remember it use to take few hours to produce the text file but now it just takes few seconds. Do you think it is working fine? Also I want to know what command should I use to find the size of a single file?

Thanks

neonsignal

09-28-2010 02:00 AM

Quote:

Originally Posted by dimpu (Post 4111035)

I want to know how many minutes/ hours should a benchmark (you know, they are huge programs)take to run using your technique.

Depends what is in the output. If it is highly repetitive, it might compress down a lot. What is the benchmark program you are using?

Quote:

Also I want to know what command should I use to find the size of a single file?

I'm not clear what you mean. Can't you just use 'ls -l'? If you mean the uncompressed length, then something like:

Code:

gunzip -c <test.out.gz | wc -c

dimpu

09-28-2010 07:50 AM

Compression Help

Neo,

I am running several benchmarks but I first started with "Cactus". I've written a program and that took almost 5 hours to run.

The other program which is the original program took just a few seconds (with a zipped output) but if I run the same program without using gzip feature it takes an hour or two or a little more therefore I am a little skeptical. Also this output is an input to my memory simulator.

The memory simulator with a zipped output shows a million instruction was fetched but with the test.out output which is a direct output file it shows an instruction fetch of 100 million in just a few minutes.

It's quite possible that you are right. There is a lot of repetition I believe.

In my second question I wanted to know in terms of size (GB/ MB) of a single file; like, suppose I have several files in my directory and I want to find the size of the output file test.out then what command should I use?

Thanks

chrism01

09-29-2010 01:30 AM

There's no way to predict the size of the compressed file; it depends on the content and compression algorithm and compression factor. Both gzip and bzip2 have a compression flag with values 1-9. Better compression = slower to create. Start with some small files and see how the trend goes.
see the man pages.

All times are GMT -5. The time now is 11:48 PM.