[SOLVED] How to read content of .gz file without extracting it anywhere, not even STDOUT!

pawan613 · 09-08-2011, 01:54 AM

Hello there!
I work on plain text log files which are approximately 4-5GB in size and I need to chop few lines (say 30000) among these file, as per my requirement. Now to save space, I compress these files in .gz, which take hardly take 300MB for same 4-5GB file and saves great amount of space.
So, it will be good if I could read such .gz files directly, without extracting it anywhere.
I have tried zcat, zless, zmore: but these shells, uncompresses the file to /dev/null. I don't want even this extraction.
So can anyone tell me, is it possible?
Any help, is highly appreciated.

unSpawn · 09-08-2011, 02:39 AM

Quote:

Originally Posted by pawan613

(..)I need to chop few lines (..) if I could read such .gz files directly, without extracting it anywhere. I have tried zcat, zless, zmore: but these shells, uncompresses the file to /dev/null. I don't want even this extraction.

More items than you think are "compressed" (kernel, music, ebooks, OOo documents, etc) and all of those need to be decompressed to work or to work on the stream and you wouldn't mind using those, right? Decompressing to /dev/null doesn't create temporary files like you would with "in place" editing so I don't understand why you're bothered with it. Do explain.

pawan613 · 09-08-2011, 03:12 AM

Well, thanks for quick reply.
What you said is indeed true.
Here, with zless command on some .gz file will follow exec gzip -d -c "$1 2>/dev/null, that means output is written at /dev/null.
What I want is to use the compressed file directly.
say, if I write like head -n 40000 file.gz | tail -n 20000, then it shows some weird output, which one cannot read.
Well, maybe I'm asking for something which is practically not possible.
Thanks.

sunnydrake · 09-08-2011, 04:11 AM

you could try to split large gz files to volumes lets say 100mb and uncompress only them ... im not sure this will work but if each volume will have own vocabulary you may have readable text.

pawan613 · 09-08-2011, 05:08 AM

@sunnydrake
Hi there.
Well, to find the small chunk what I was looking for, first I have to search through entire archive then one can split wherever required.
So, at the end same question.

By any means, I don't want to uncompress my archives.
Thanks.

unSpawn · 09-08-2011, 06:48 AM

Quote:

Originally Posted by pawan613

say, if I write like head -n 40000 file.gz | tail -n 20000, then it shows some weird output, which one cannot read.

Something like 'zcat file.gz|sed -n '40000,20000p' should print starting at line 40000 and then the next 20000 lines.

pawan613 · 09-08-2011, 07:14 AM

Yeah, I wrote it in my post itself that zcat works.
But it extracts the file first.
Thanks!

sunnydrake · 09-08-2011, 07:43 PM

smaller archives takes less memory and disk space to process. so you will have smaller system load vs large file search.
you search procedure will be like for file in files zcat file exit if found etc.
BASIC archive idea is to minimize file space by finding repeating patterns and reuse them like lets char '7' be '123456' that occurs in file (and '7' not) so without uncompressing data will be crypted.

corp769 · 09-08-2011, 07:54 PM

After reading over what you are really trying to do, the answer is no, you can not do it. The reason why is because you don't have a proper utility to decompress part of an original compressed file(s). The file or files are either compressed, or not. If they are, then you can't work with them directly unless fully compressed first. Hope that sums it up good enough for you man.

pawan613 · 09-09-2011, 05:11 AM

Hello there.
Just got to know about a utility cgrep.
You must be aware of it, but in case not then check:
http://www.di.unipi.it/~ferragin/Lib...pressedSearch/

According to explanation under "CGrep Library", it says that it can search inside compressed files and no need of decompression.
Now, All I wanted to know is how to get this utility.
I'm using Ubuntu 10.4 and tried to execute this cgrep command. But it says "No command found", that means I have to install some packages first.
Please please tell me how get those packages and how to install them.
I tried apt-get install cgrep, but nothing exists like that.
Desperately waiting for help.
Thanks a ton.

corp769 · 09-09-2011, 07:59 AM

Quote:

Originally Posted by pawan613

Hello there.
Just got to know about a utility cgrep.
You must be aware of it, but in case not then check:
http://www.di.unipi.it/~ferragin/Lib...pressedSearch/

According to explanation under "CGrep Library", it says that it can search inside compressed files and no need of decompression.
Now, All I wanted to know is how to get this utility.
I'm using Ubuntu 10.4 and tried to execute this cgrep command. But it says "No command found", that means I have to install some packages first.
Please please tell me how get those packages and how to install them.
I tried apt-get install cgrep, but nothing exists like that.
Desperately waiting for help.
Thanks a ton.

http://sourceforge.net/projects/cgrep/

There's the source man, enjoy

Reuti · 09-09-2011, 09:00 AM

Quote:

Originally Posted by pawan613

Yeah, I wrote it in my post itself that zcat works.
But it extracts the file first.

What do you mean by “it extracts the file first” in detail? You need a temporary space in /tmp or so to hold the 4-5GB of data? This is not the way it should be.

The problem with the compressed file is, that it’s not compressing each line as a single record therein. It would be better to have a compression like it’s used for music streams in your case. You can start at any point, and at least at the next frame you can interpret the compressed data.

pawan613 · 09-11-2011, 05:03 AM

Thank you very much corp769 and everyone.
@Reuti: It meant that I want to operate directly on compressed files without extracting it.
Anyways, things are working just fine.
This was my first post here on LinuxQuestions.org and got very quick reply.
Thanks again.

Reuti · 09-11-2011, 06:26 AM

Quote:

Originally Posted by pawan613

Anyways, things are working just fine.

What is working fine in detail? The link about searching in compressed files you pointed to is interesting, but it requires the file to be compressed by their huffm to be searched by their cgrep (both available by the links in the page). To me it looks, like the cgrep corp769 pointed to is a different one. But according to the paper, you could investigate zgrep, though it won’t speedup things much if it’s really only a combination of grep and gunzip.

David the H. · 09-11-2011, 06:40 AM

Just thinking, I'm not sure if it would work, but how about a named pipe?

As the decompression operation is feeding the fifo on one end, grep or something could be reading it on the other for the lines you want, and discarding the rest.

Only a small part of the file should then be in memory at any given time.