[SOLVED] How to remove first 20 lines from a big file? Suggest fastest way to do it.
Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
A 400 Gb file is very big and it will take time to process. You might even run into hardware limitations that could make this very hard to do (memory and/or disk space).
Perl's Tie::File module overcomes the memory problem and might be worth a try. Have a look at this:
Code:
#!/usr/bin/perl
use strict ;
use warnings ;
use Tie::File ;
my @array ;
tie @array, 'Tie::File', "/path/to/big_file" or die ;
splice( @array, 0, 20 ) ;
untie @array ;
exit 0 ;
Replace the bold part with the file in question.
EDIT: Just to make sure:
Warning: Changes are made in place!! Make sure you have a backup!!
Tail still needs to read the entire file - I would be surprised that a properly constructed sed invocation would be any slower.
One would hope an "update-in-place" solution would be significantly quicker - especially as file size increases. Although streams tend not to lend themselves to this sort of processing. Perhaps "tie" needs a short-circuit.
@mitter1989: If that works for you then I'm glad, although I seriously doubt it works.
I've worked with very big files myself and the solution provided by you has drawbacks which I mentioned earlier. One obvious one is the need for a second file (which needs +/- 400Gb free disk space).
I've also ran into this error message when using tail and very big files: tail: There is not enough memory available now..
I'm starting to doubt if this file is actually 400 Gb.
One would hope an "update-in-place" solution would be significantly quicker
No. sed's "in-place" option creates a new file and then renames it back to the original name. There is just no way to chop stuff from the beginning of a file without copying the rest, and copying back to the same file you are reading would be far too risky.
Well... It can be done in place - though the data still has to be copied, and written, and will NOT be even as fast as copying the file. It will also be a custom application as this doesn't happen very often. Most 400GB files are binary, and not text.
What you have to do is to have a big enough buffer to hold the entire first 20+ lines. Trim the buffer and write it back to the file (you DID open it read-write), Now you know where the end of what you wrote is AND you know where the new input is (after all it is at the end of the first buffer input...)
You end up seeking to the new data to read it in, then seeking to the end of updated area, and write your buffer. When you reach the end of input, you have to truncate the file to the new end location.
This is NOT a fast process as it causes all kinds of headache for read-ahead, (seeking may flush some of the input data).
Most importantly, if the program aborts before finishing then the data file is corrupted.
There is just no way to chop stuff from the beginning of a file without copying the rest, and copying back to the same file you are reading would be far too risky.
Perl's Tie::File module does just that, it also survives a ctrl-c in the middle of the operation (checked and tried myself). I'm guessing that something as described by jpollard is used.
As long as the machine one is working on is able to handle the tail or sed solution (memory and disk usage) these should be used, they are much faster the the perl solution.
Using a not so big 16Gb ascii file, tests show this:
sync ; time tail -n +21 infile > outfile -> 2m44.529s sync ; time sed -i '1,20d' infile -> 3m28.205s sync ; time rem20.pl -> 6m0.878s
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.