[SOLVED] How to remove first 20 lines from a big file? Suggest fastest way to do it.

mitter1989 · 10-19-2013, 12:17 AM

Hi Folks,

I am using CentOS and I have about 400 GB of file and I want to remove first 20 lines from this.

I tried to achieve this with sed command but it takes hours to complete.

Please suggest fastest command/way to achieve this.

druuna · 10-19-2013, 02:35 AM

A 400 Gb file is very big and it will take time to process. You might even run into hardware limitations that could make this very hard to do (memory and/or disk space).

Perl's Tie::File module overcomes the memory problem and might be worth a try. Have a look at this:

Code:

#!/usr/bin/perl
use strict ;
use warnings ;
use Tie::File ;

my @array ;
tie @array, 'Tie::File', "/path/to/big_file" or die ;
splice( @array, 0, 20 ) ;
untie @array ;

exit 0 ;

Replace the bold part with the file in question.

EDIT: Just to make sure:

Warning: Changes are made in place!! Make sure you have a backup!!

syg00 · 10-19-2013, 03:52 AM

Interesting ...

mitter1989 · 10-19-2013, 05:48 AM

Hello Druuna,

Thanks for your response.

Script which you have shared is not that efficient, it is taking too much time.

I have Googled and found this command which is useful and quite efficient :
tail -n +20 file.txt > file2.txt

syg00 · 10-19-2013, 06:11 AM

Tail still needs to read the entire file - I would be surprised that a properly constructed sed invocation would be any slower.
One would hope an "update-in-place" solution would be significantly quicker - especially as file size increases. Although streams tend not to lend themselves to this sort of processing. Perhaps "tie" needs a short-circuit.

Added to my "to-test" list.

druuna · 10-19-2013, 06:32 AM

@mitter1989: If that works for you then I'm glad, although I seriously doubt it works.

I've worked with very big files myself and the solution provided by you has drawbacks which I mentioned earlier. One obvious one is the need for a second file (which needs +/- 400Gb free disk space).

I've also ran into this error message when using tail and very big files: tail: There is not enough memory available now..

I'm starting to doubt if this file is actually 400 Gb.

Maybe this is a theoretical question?

rknichols · 10-19-2013, 09:27 AM

Quote:

Originally Posted by syg00

One would hope an "update-in-place" solution would be significantly quicker

No. sed's "in-place" option creates a new file and then renames it back to the original name. There is just no way to chop stuff from the beginning of a file without copying the rest, and copying back to the same file you are reading would be far too risky.

jpollard · 10-19-2013, 09:57 AM

Well... It can be done in place - though the data still has to be copied, and written, and will NOT be even as fast as copying the file. It will also be a custom application as this doesn't happen very often. Most 400GB files are binary, and not text.

What you have to do is to have a big enough buffer to hold the entire first 20+ lines. Trim the buffer and write it back to the file (you DID open it read-write), Now you know where the end of what you wrote is AND you know where the new input is (after all it is at the end of the first buffer input...)

You end up seeking to the new data to read it in, then seeking to the end of updated area, and write your buffer. When you reach the end of input, you have to truncate the file to the new end location.

This is NOT a fast process as it causes all kinds of headache for read-ahead, (seeking may flush some of the input data).

Most importantly, if the program aborts before finishing then the data file is corrupted.

druuna · 10-19-2013, 12:00 PM

Quote:

Originally Posted by rknichols

There is just no way to chop stuff from the beginning of a file without copying the rest, and copying back to the same file you are reading would be far too risky.

Perl's Tie::File module does just that, it also survives a ctrl-c in the middle of the operation (checked and tried myself). I'm guessing that something as described by jpollard is used.

As long as the machine one is working on is able to handle the tail or sed solution (memory and disk usage) these should be used, they are much faster the the perl solution.

Using a not so big 16Gb ascii file, tests show this:

sync ; time tail -n +21 infile > outfile -> 2m44.529s
sync ; time sed -i '1,20d' infile -> 3m28.205s
sync ; time rem20.pl -> 6m0.878s