LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 10-19-2013, 12:17 AM   #1
mitter1989
Member
 
Registered: Sep 2013
Posts: 47

Rep: Reputation: Disabled
Question How to remove first 20 lines from a big file? Suggest fastest way to do it.


Hi Folks,


I am using CentOS and I have about 400 GB of file and I want to remove first 20 lines from this.



I tried to achieve this with sed command but it takes hours to complete.


Please suggest fastest command/way to achieve this.
 
Old 10-19-2013, 02:35 AM   #2
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405
A 400 Gb file is very big and it will take time to process. You might even run into hardware limitations that could make this very hard to do (memory and/or disk space).

Perl's Tie::File module overcomes the memory problem and might be worth a try. Have a look at this:
Code:
#!/usr/bin/perl
use strict ;
use warnings ;
use Tie::File ;

my @array ;
tie @array, 'Tie::File', "/path/to/big_file" or die ;
splice( @array, 0, 20 ) ;
untie @array ;

exit 0 ;
Replace the bold part with the file in question.

EDIT: Just to make sure:
Warning: Changes are made in place!! Make sure you have a backup!!

Last edited by druuna; 10-19-2013 at 04:16 AM.
 
1 members found this post helpful.
Old 10-19-2013, 03:52 AM   #3
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Interesting ...
 
Old 10-19-2013, 05:48 AM   #4
mitter1989
Member
 
Registered: Sep 2013
Posts: 47

Original Poster
Rep: Reputation: Disabled
Thumbs up

Hello Druuna,

Thanks for your response.

Script which you have shared is not that efficient, it is taking too much time.


I have Googled and found this command which is useful and quite efficient :
tail -n +20 file.txt > file2.txt
 
Old 10-19-2013, 06:11 AM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Tail still needs to read the entire file - I would be surprised that a properly constructed sed invocation would be any slower.
One would hope an "update-in-place" solution would be significantly quicker - especially as file size increases. Although streams tend not to lend themselves to this sort of processing. Perhaps "tie" needs a short-circuit.

Added to my "to-test" list.
 
Old 10-19-2013, 06:32 AM   #6
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405
@mitter1989: If that works for you then I'm glad, although I seriously doubt it works.

I've worked with very big files myself and the solution provided by you has drawbacks which I mentioned earlier. One obvious one is the need for a second file (which needs +/- 400Gb free disk space).

I've also ran into this error message when using tail and very big files: tail: There is not enough memory available now..

I'm starting to doubt if this file is actually 400 Gb.

Maybe this is a theoretical question?
 
Old 10-19-2013, 09:27 AM   #7
rknichols
Senior Member
 
Registered: Aug 2009
Distribution: Rocky Linux
Posts: 4,779

Rep: Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212Reputation: 2212
Quote:
Originally Posted by syg00 View Post
One would hope an "update-in-place" solution would be significantly quicker
No. sed's "in-place" option creates a new file and then renames it back to the original name. There is just no way to chop stuff from the beginning of a file without copying the rest, and copying back to the same file you are reading would be far too risky.
 
Old 10-19-2013, 09:57 AM   #8
jpollard
Senior Member
 
Registered: Dec 2012
Location: Washington DC area
Distribution: Fedora, CentOS, Slackware
Posts: 4,912

Rep: Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513Reputation: 1513
Well... It can be done in place - though the data still has to be copied, and written, and will NOT be even as fast as copying the file. It will also be a custom application as this doesn't happen very often. Most 400GB files are binary, and not text.

What you have to do is to have a big enough buffer to hold the entire first 20+ lines. Trim the buffer and write it back to the file (you DID open it read-write), Now you know where the end of what you wrote is AND you know where the new input is (after all it is at the end of the first buffer input...)

You end up seeking to the new data to read it in, then seeking to the end of updated area, and write your buffer. When you reach the end of input, you have to truncate the file to the new end location.

This is NOT a fast process as it causes all kinds of headache for read-ahead, (seeking may flush some of the input data).

Most importantly, if the program aborts before finishing then the data file is corrupted.
 
Old 10-19-2013, 12:00 PM   #9
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405Reputation: 2405
Quote:
Originally Posted by rknichols View Post
There is just no way to chop stuff from the beginning of a file without copying the rest, and copying back to the same file you are reading would be far too risky.
Perl's Tie::File module does just that, it also survives a ctrl-c in the middle of the operation (checked and tried myself). I'm guessing that something as described by jpollard is used.

As long as the machine one is working on is able to handle the tail or sed solution (memory and disk usage) these should be used, they are much faster the the perl solution.

Using a not so big 16Gb ascii file, tests show this:

sync ; time tail -n +21 infile > outfile -> 2m44.529s
sync ; time sed -i '1,20d' infile -> 3m28.205s
sync ; time rem20.pl -> 6m0.878s
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Remove specific lines from file elfoozo Programming 9 01-22-2011 11:07 AM
Divide a big file into small pieces, each has same number of lines Kunsheng Linux - Software 1 10-16-2009 04:08 PM
How to remove lines from a file doza Linux - General 2 04-27-2005 11:59 AM
How do i remove blank lines from a file? kakho Programming 1 04-15-2004 03:57 AM
[bash] remove lines from a file Drimo Programming 3 03-20-2004 11:16 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 10:09 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration