LinuxQuestions.org
LinuxAnswers - the LQ Linux tutorial section.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 12-17-2007, 06:35 AM   #1
horacioemilio
Member
 
Registered: Dec 2007
Posts: 61

Rep: Reputation: 15
Deleting lines from a file


Hi,

I need to write a program which reads an external text file. Each time it reads, then it needs to delete some lines, for instance from second line to 55th line. The file is really big, so what do you think is the fastest method to delete specific lines in a text file ?

Thanks
 
Old 12-17-2007, 06:39 AM   #2
horacioemilio
Member
 
Registered: Dec 2007
Posts: 61

Original Poster
Rep: Reputation: 15
sorry, I forgot, I meant in Python !!!


Quote:
Originally Posted by horacioemilio View Post
Hi,

I need to write a program which reads an external text file. Each time it reads, then it needs to delete some lines, for instance from second line to 55th line. The file is really big, so what do you think is the fastest method to delete specific lines in a text file ?

Thanks
 
Old 12-17-2007, 06:44 AM   #3
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 240Reputation: 240Reputation: 240
Code:
for num,line in enumerate(open("file")):
    if num > 0 and num <= 54: continue
    else: print line.strip()

Last edited by ghostdog74; 12-17-2007 at 08:11 AM.
 
Old 12-17-2007, 06:54 AM   #4
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
Why Python??

Quote:
The file is really big, so what do you think is the fastest method to delete specific lines in a text file ?
I think SED might be the fastest....

sed '2,55 d' file >newfile

If the rest of the program is in Python, you can still call SED from within the program
 
Old 12-17-2007, 08:26 AM   #5
cconstantine
Member
 
Registered: Dec 2005
Distribution: RedHat, Ubuntu
Posts: 101

Rep: Reputation: 15
perhaps just "mark" the deleted lines?

this will depend on what you use the file for (that is, what program reads it). But you could just write some obviousness (like "--DELETED--") at the front of the lines you wanted to delete. If the file is seriously huge, that will be much quicker since whatever-you-use doesn't have to copy the rest of the file to "shift up" the lines you're keeping. Then just educate/extend whatever uses the file to understand it should ignore the marked lines. Bonus points for cleaning the file up once in a while by actually making a pass to delete the marked lines...
 
Old 12-17-2007, 08:27 AM   #6
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 240Reputation: 240Reputation: 240
Quote:
Originally Posted by pixellany View Post
Why Python??


I think SED might be the fastest....

sed '2,55 d' file >newfile
Code:
# wc -l file
15986709 file
# head -10 file
     1  this is a line
     2  this is a line
     3  this is a line
     4  this is a line
     5  this is a line
     6  this is a line
     7  this is a line
     8  this is a line
     9  this is a line
    10  this is a line
# tail -10 file
15986701        this is a line
15986702        this is a line
15986703        this is a line
15986704        this is a line
15986705        this is a line
15986706        this is a line
15986707        this is a line
15986708        this is a line
15986709        this is a line
# time sed '2,55d' file  > sedtest

real    1m22.018s
user    1m2.200s
sys     0m2.344s

# time ./test.py > pytest

real    1m9.778s
user    0m48.331s
sys     0m2.568s
# time sed '2,55d' file  > sedtest

real    1m19.134s
user    1m1.144s
sys     0m2.092s
# time ./test.py > pytest

real    1m9.406s
user    0m47.819s
sys     0m2.416s
# head -10 sedtest
     1  this is a line
    56  this is a line
    57  this is a line
    58  this is a line
    59  this is a line
    60  this is a line
    61  this is a line
    62  this is a line
    63  this is a line
    64  this is a line
# head -10 pytest
1       this is a line
56      this is a line
57      this is a line
58      this is a line
59      this is a line
60      this is a line
61      this is a line
62      this is a line
63      this is a line
64      this is a line
Quote:
If the rest of the program is in Python, you can still call SED from within the program
that would make the code not very "portable", IMO. Also, Python is a feature rich language, whatever sed can do, Python can too, and much more. Therefore, there's really no need to call sed from Python.

Last edited by ghostdog74; 12-17-2007 at 09:39 AM.
 
Old 12-17-2007, 10:31 AM   #7
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 62
The example python program doesn't actually create the same output as sed, and what if the input to the python program doesn't have the line number on each line?

Besides, as usual Perl is faster than both of them, averaging about 44 seconds on my system for similar input data, compared to about 62 seconds for Python, and 74 seconds for sed.

Here's a Perl solution:
Code:
perl -ne 'print if ($. > 55 || $. < 2);' file > output_file
 
Old 12-17-2007, 08:16 PM   #8
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,396

Rep: Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814Reputation: 814
First about correctness:
The python program doesn't rely on the line number being in the file, num is equivalent to $. in the perl program, however it uses strip() which changes the lines, it should be written like this:

Code:
#!/usr/bin/env python
for num,line in enumerate(open("file.in")):
    if num > 0 and num <= 55: continue
    else: print line,
Code:
$ head file.in
this the first line
this the second line
this the third line
this the fourth line
this the fifth line
this the sixth line
this the seventh line
this the eighth line
this the ninth line
this the tenth line
$ tail file.in
this the fifty-first line
this the fifty-second line
this the fifty-third line
this the fifty-fourth line
this the fifty-fifth line
this the fifty-sixth line
this the fifty-seventh line
this the fifty-eighth line
this the fifty-ninth line
this the sixtieth line
$ ./enum-lines.py 
this the first line
this the fifty-seventh line
this the fifty-eighth line
this the fifty-ninth line
this the sixtieth line
Now about performance.

Oddly enough, I found sed to be faster (note that these timing are from before I noticed the strip() thing in the python program, I'm to lazy to redo that now). I'm running a PIII 700Mhz, and I had top running. I noticed that sed never reached more than 80% cpu, whereas the python program was >90% cpu most of the time, and the perl was a bit better than the python program here.
Code:
$ time sed '2,55 d' file.in > sedtest

real    0m42.410s
user    0m19.963s
sys     0m4.980s
$ time ./enum-lines.py  >enum-py-test

real    1m29.955s
user    1m16.050s
sys     0m6.247s
$ time perl -ne 'print if ($. > 55 || $. < 2);' file.in > perl-test

real    1m8.195s
user    0m49.502s
sys     0m4.981s
$ time sed '2,55 d' file.in > sedtest

real    0m43.264s
user    0m19.930s
sys     0m5.111s
$ time ./enum-lines.py  >enum-py-test

real    1m32.172s
user    1m17.388s
sys     0m5.975s

$ time perl -ne 'print if ($. > 55 || $. < 2);' file.in > perl-test

real    1m8.195s
user    0m49.502s
sys     0m4.981s
$ time sed '2,55 d' file.in > sedtest

real    0m43.264s
user    0m19.930s
sys     0m5.111s
$ time ./enum-lines.py  >enum-py-test

real    1m32.172s
user    1m17.388s
sys     0m5.975s
$ time perl -ne 'print if ($. > 55 || $. < 2);' file.in > perl-test

real    1m5.771s
user    0m49.297s
sys     0m5.169s
However it seems that a faster way is to use mmap:
Code:
#!/usr/bin/env python
import os, mmap

startcut = 2
endcut = 55


fd = os.open("file.in", os.O_RDWR)
filesize = os.fstat(fd).st_size
file = mmap.mmap(fd, filesize)

lineno = 1
pos = 0
while lineno < startcut:
    pos = file.find('\n', pos)+1;
    lineno += 1

startpos = pos

while lineno <= endcut:
    pos = file.find('\n', pos)+1;
    lineno += 1

file.move(startpos, pos, filesize-pos)
file.resize(filesize - (pos-startpos))


file.close()
os.close(fd)
Code:
$ time ./use-mmap.py 

real    0m25.349s
user    0m2.466s
sys     0m0.597s

$ time ./use-mmap.py 

real    0m24.235s
user    0m2.442s
sys     0m0.673s
I think that this method won't be as great if you have to delete lines in the middle, and if you delete from the end, it should be rewritten to start searching from the end. If there is a find equivalent function that searches by byte instead of string it might be faster as well. Also this code modifies the file in place, not sure how much of a difference that makes...

And of course cconstantine's suggestion would be fastest if you can use it.
 
Old 12-17-2007, 10:02 PM   #9
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 240Reputation: 240Reputation: 240
Quote:
Originally Posted by matthewg42 View Post
The example python program doesn't actually create the same output as sed,
probably because of the strip(), which remove leading spaces as well.

Quote:
and what if the input to the python program doesn't have the line number on each line?
I created the input file to have line numbers, using cat -n... So both program uses the same input file. So i don't understand what you mean here.


Quote:
Besides, as usual Perl is faster than both of them, averaging about 44 seconds on my system for similar input data, compared to about 62 seconds for Python, and 74 seconds for sed.
well, i probably should have given a better way to do it in Python. Here's one way it can be rewritten.
Code:
file = open("file1")
o = open('pytest', 'w')
lines = file.readlines(100)
lines[1:55] = ''
o.write(''.join(lines))
while 1:
    lines = file.readlines(100000)
    if not lines:
        break    
    o.write(''.join(lines))
o.close()
output:
Code:
# wc -l file1 
1000000 file1
# head -10 file1
     1  this is line
     2  this is line
     3  this is line
     4  this is line
     5  this is line
     6  this is line
     7  this is line
     8  this is line
     9  this is line
    10  this is line
# tail -10 file1
999991  this is line
999992  this is line
999993  this is line
999994  this is line
999995  this is line
999996  this is line
999997  this is line
999998  this is line
999999  this is line
1000000 this is line
# time perl -ne 'print if ($. > 55 || $. < 2);' file1 > perltest

real    0m1.479s
user    0m1.228s
sys     0m0.128s
# time sed '2,55 d' file1 > sedtest

real    0m3.692s
user    0m3.540s
sys     0m0.108s
# time ./test.py > pytest

real    0m0.503s
user    0m0.384s
sys     0m0.104s

# time awk 'NR<2 || NR>55{print}' file1 > awktest

real    0m1.377s
user    0m1.224s
sys     0m0.148s
# diff perltest sedtest
# diff perltest pytest
# diff perltest awktest
# time perl -ne 'print if ($. > 55 || $. < 2);' file1 > perltest

real    0m1.351s
user    0m1.248s
sys     0m0.096s
# time sed '2,55 d' file1 > sedtest

real    0m3.562s
user    0m3.440s
sys     0m0.116s
# time ./test.py > pytest

real    0m0.512s
user    0m0.396s
sys     0m0.116s
# time awk 'NR<2 || NR>55{print}' file1 > awktest

real    0m1.355s
user    0m1.264s
sys     0m0.088s
# diff perltest awktest
# diff perltest sedtest
# diff perltest pytest
# ls -l perltest awktest sedtest pytest |awk '{print $5,$9}'
19998921 awktest
19998921 perltest
19998921 pytest
19998921 sedtest
#
you might probably come back with a better Perl version, however, the point i am illustrating is, comparing languages is often a "sensitive" topic. So i would rather stick to what the OP actually wants. He wanted a Python solution, so we give him one (not for homework of course, but even if it is, ...). I am not against people suggestion other languages, because ultimately the decision to use which one lies upon OP himself and because its a forum, however when people want to make claims like "this is better than that", then it should be substantiated.
 
Old 12-17-2007, 10:06 PM   #10
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 718

Rep: Reputation: 72
Hi.
Quote:
Originally Posted by cconstantine View Post
this will depend on what you use the file for (that is, what program reads it). But you could just write some obviousness (like "--DELETED--") at the front of the lines you wanted to delete. If the file is seriously huge, that will be much quicker since whatever-you-use doesn't have to copy the rest of the file to "shift up" the lines you're keeping. Then just educate/extend whatever uses the file to understand it should ignore the marked lines. Bonus points for cleaning the file up once in a while by actually making a pass to delete the marked lines...
This is an interesting idea. However, it cannot be done with commands like GNU sed. The "in-place" option on GNU sed is a misnomer, it does not write in-place:
Quote:
`--in-place[=SUFFIX]'
This option specifies that files are to be edited in-place. GNU
`sed' does this by creating a temporary file and sending output to
this file rather than to the standard output.(1).

This option implies `-s'.

When the end of the file is reached, the temporary file is renamed
to the output file's original name. The extension, if supplied,
is used to modify the name of the old file before renaming the
temporary file, thereby making a backup copy(2)).

-- excerpt from info sed
In other words, sed is doing a little of the work to save you the trouble of writing those actions in your shell script. The sed command is still writing the entire file (less any deletes and including any changes or additions) to the temporary.

I have not tried a real write in-place. I can think of it working like this: one would need to have the byte address of the line of interest -- by reading to that line, and remembering the byte position. You edit the line in memory, seek to the byte address, and write. I know there are some systems where that is possible, but I have not done it in *nix. You and the filesystem would need to agree to re-write the block in which the line is contained.

At any rate, it's not doable with sed as far as I know.

If the files are large enough and there are a lot of them, then this might justify creating a custom program to perform the action -- if the re-writing behavior is allowable by the filesystem. I quickly looked through Stevens' Advanced Programming in the UNIX Environment, and it looked like unbuffered IO might work, but someone with more experience would need to advise at this point ... cheers, makyo
 
Old 12-17-2007, 10:30 PM   #11
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728Reputation: 728
So, what do we have here???

OP has been left behind (let's hope that his/her question got answered....)

My limited knowledge has been exposed. (I already knew about that, so no upset there....)

I can be confident of one thing:
Writing "sed '2,55 d' file>newfile" took far less time than any other solution offered.
 
Old 12-17-2007, 10:53 PM   #12
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,695
Blog Entries: 5

Rep: Reputation: 240Reputation: 240Reputation: 240
Wink

Quote:
Originally Posted by pixellany View Post
So, what do we have here???

OP has been left behind (let's hope that his/her question got answered....)
yes it has, in another news group

Quote:
I can be confident of one thing:
Writing "sed '2,55 d' file>newfile" took far less time than any other solution offered.
I was hoping you can show it, but well, its your choice.
If a shell solution is desired in terms of speed, maybe using the tail/head combination ?
Code:
# more test.sh
#!/bin/sh
head -1 file1 > tailheadtest
tail +56 file1 >> tailheadtest
# time ./test.sh

real    0m0.150s
user    0m0.020s
sys     0m0.132s
# diff perltest tailheadtest
# diff perltest awktest
# diff awktest tailheadtest
# diff sedtest tailheadtest
# ls -l *test | awk '{print $5,$9}'
19998921 awktest
19998921 perltest
19998921 pytest
19998921 sedtest
19998921 tailheadtest

Last edited by ghostdog74; 12-17-2007 at 10:56 PM.
 
Old 12-17-2007, 11:03 PM   #13
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 12,102

Rep: Reputation: 982Reputation: 982Reputation: 982Reputation: 982Reputation: 982Reputation: 982Reputation: 982Reputation: 982
Maybe a cache/buffer flush might be in order between runs.
 
Old 12-18-2007, 10:30 AM   #14
PAix
Member
 
Registered: Jul 2007
Location: United Kingdom, W Mids
Distribution: SUSE 11.0 as of Nov 2008
Posts: 195

Rep: Reputation: 40
Quote:
yes it has, in another news group
A rather impolite approach to asking people to spend time considering a problem.
 
Old 12-18-2007, 03:36 PM   #15
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 12,102

Rep: Reputation: 982Reputation: 982Reputation: 982Reputation: 982Reputation: 982Reputation: 982Reputation: 982Reputation: 982
I see same requests here and on the gentoo fora all the time - I generally just pass on by.

But the other side of it is that some threads wander off and become interesting regardless of the OP. This one f'instance.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
deleting even numbered lines bharatbsharma Programming 7 11-26-2007 05:34 AM
deleting specified lines in a huge text file ruh31 Linux - General 10 06-30-2006 03:34 AM
Deleting the lines from a file using shell scripts sharad Linux - General 1 05-22-2006 03:17 AM


All times are GMT -5. The time now is 01:41 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration