ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I need to write a program which reads an external text file. Each time it reads, then it needs to delete some lines, for instance from second line to 55th line. The file is really big, so what do you think is the fastest method to delete specific lines in a text file ?
I need to write a program which reads an external text file. Each time it reads, then it needs to delete some lines, for instance from second line to 55th line. The file is really big, so what do you think is the fastest method to delete specific lines in a text file ?
this will depend on what you use the file for (that is, what program reads it). But you could just write some obviousness (like "--DELETED--") at the front of the lines you wanted to delete. If the file is seriously huge, that will be much quicker since whatever-you-use doesn't have to copy the rest of the file to "shift up" the lines you're keeping. Then just educate/extend whatever uses the file to understand it should ignore the marked lines. Bonus points for cleaning the file up once in a while by actually making a pass to delete the marked lines...
# wc -l file
15986709 file
# head -10 file
1 this is a line
2 this is a line
3 this is a line
4 this is a line
5 this is a line
6 this is a line
7 this is a line
8 this is a line
9 this is a line
10 this is a line
# tail -10 file
15986701 this is a line
15986702 this is a line
15986703 this is a line
15986704 this is a line
15986705 this is a line
15986706 this is a line
15986707 this is a line
15986708 this is a line
15986709 this is a line
# time sed '2,55d' file > sedtest
real 1m22.018s
user 1m2.200s
sys 0m2.344s
# time ./test.py > pytest
real 1m9.778s
user 0m48.331s
sys 0m2.568s
# time sed '2,55d' file > sedtest
real 1m19.134s
user 1m1.144s
sys 0m2.092s
# time ./test.py > pytest
real 1m9.406s
user 0m47.819s
sys 0m2.416s
# head -10 sedtest
1 this is a line
56 this is a line
57 this is a line
58 this is a line
59 this is a line
60 this is a line
61 this is a line
62 this is a line
63 this is a line
64 this is a line
# head -10 pytest
1 this is a line
56 this is a line
57 this is a line
58 this is a line
59 this is a line
60 this is a line
61 this is a line
62 this is a line
63 this is a line
64 this is a line
Quote:
If the rest of the program is in Python, you can still call SED from within the program
that would make the code not very "portable", IMO. Also, Python is a feature rich language, whatever sed can do, Python can too, and much more. Therefore, there's really no need to call sed from Python.
Last edited by ghostdog74; 12-17-2007 at 09:39 AM.
The example python program doesn't actually create the same output as sed, and what if the input to the python program doesn't have the line number on each line?
Besides, as usual Perl is faster than both of them, averaging about 44 seconds on my system for similar input data, compared to about 62 seconds for Python, and 74 seconds for sed.
First about correctness:
The python program doesn't rely on the line number being in the file, num is equivalent to $. in the perl program, however it uses strip() which changes the lines, it should be written like this:
Code:
#!/usr/bin/env python
for num,line in enumerate(open("file.in")):
if num > 0 and num <= 55: continue
else: print line,
Code:
$ head file.in
this the first line
this the second line
this the third line
this the fourth line
this the fifth line
this the sixth line
this the seventh line
this the eighth line
this the ninth line
this the tenth line
$ tail file.in
this the fifty-first line
this the fifty-second line
this the fifty-third line
this the fifty-fourth line
this the fifty-fifth line
this the fifty-sixth line
this the fifty-seventh line
this the fifty-eighth line
this the fifty-ninth line
this the sixtieth line
$ ./enum-lines.py
this the first line
this the fifty-seventh line
this the fifty-eighth line
this the fifty-ninth line
this the sixtieth line
Now about performance.
Oddly enough, I found sed to be faster (note that these timing are from before I noticed the strip() thing in the python program, I'm to lazy to redo that now). I'm running a PIII 700Mhz, and I had top running. I noticed that sed never reached more than 80% cpu, whereas the python program was >90% cpu most of the time, and the perl was a bit better than the python program here.
Code:
$ time sed '2,55 d' file.in > sedtest
real 0m42.410s
user 0m19.963s
sys 0m4.980s
$ time ./enum-lines.py >enum-py-test
real 1m29.955s
user 1m16.050s
sys 0m6.247s
$ time perl -ne 'print if ($. > 55 || $. < 2);' file.in > perl-test
real 1m8.195s
user 0m49.502s
sys 0m4.981s
$ time sed '2,55 d' file.in > sedtest
real 0m43.264s
user 0m19.930s
sys 0m5.111s
$ time ./enum-lines.py >enum-py-test
real 1m32.172s
user 1m17.388s
sys 0m5.975s
$ time perl -ne 'print if ($. > 55 || $. < 2);' file.in > perl-test
real 1m8.195s
user 0m49.502s
sys 0m4.981s
$ time sed '2,55 d' file.in > sedtest
real 0m43.264s
user 0m19.930s
sys 0m5.111s
$ time ./enum-lines.py >enum-py-test
real 1m32.172s
user 1m17.388s
sys 0m5.975s
$ time perl -ne 'print if ($. > 55 || $. < 2);' file.in > perl-test
real 1m5.771s
user 0m49.297s
sys 0m5.169s
However it seems that a faster way is to use mmap:
$ time ./use-mmap.py
real 0m25.349s
user 0m2.466s
sys 0m0.597s
$ time ./use-mmap.py
real 0m24.235s
user 0m2.442s
sys 0m0.673s
I think that this method won't be as great if you have to delete lines in the middle, and if you delete from the end, it should be rewritten to start searching from the end. If there is a find equivalent function that searches by byte instead of string it might be faster as well. Also this code modifies the file in place, not sure how much of a difference that makes...
And of course cconstantine's suggestion would be fastest if you can use it.
The example python program doesn't actually create the same output as sed,
probably because of the strip(), which remove leading spaces as well.
Quote:
and what if the input to the python program doesn't have the line number on each line?
I created the input file to have line numbers, using cat -n... So both program uses the same input file. So i don't understand what you mean here.
Quote:
Besides, as usual Perl is faster than both of them, averaging about 44 seconds on my system for similar input data, compared to about 62 seconds for Python, and 74 seconds for sed.
well, i probably should have given a better way to do it in Python. Here's one way it can be rewritten.
Code:
file = open("file1")
o = open('pytest', 'w')
lines = file.readlines(100)
lines[1:55] = ''
o.write(''.join(lines))
while 1:
lines = file.readlines(100000)
if not lines:
break
o.write(''.join(lines))
o.close()
output:
Code:
# wc -l file1
1000000 file1
# head -10 file1
1 this is line
2 this is line
3 this is line
4 this is line
5 this is line
6 this is line
7 this is line
8 this is line
9 this is line
10 this is line
# tail -10 file1
999991 this is line
999992 this is line
999993 this is line
999994 this is line
999995 this is line
999996 this is line
999997 this is line
999998 this is line
999999 this is line
1000000 this is line
# time perl -ne 'print if ($. > 55 || $. < 2);' file1 > perltest
real 0m1.479s
user 0m1.228s
sys 0m0.128s
# time sed '2,55 d' file1 > sedtest
real 0m3.692s
user 0m3.540s
sys 0m0.108s
# time ./test.py > pytest
real 0m0.503s
user 0m0.384s
sys 0m0.104s
# time awk 'NR<2 || NR>55{print}' file1 > awktest
real 0m1.377s
user 0m1.224s
sys 0m0.148s
# diff perltest sedtest
# diff perltest pytest
# diff perltest awktest
# time perl -ne 'print if ($. > 55 || $. < 2);' file1 > perltest
real 0m1.351s
user 0m1.248s
sys 0m0.096s
# time sed '2,55 d' file1 > sedtest
real 0m3.562s
user 0m3.440s
sys 0m0.116s
# time ./test.py > pytest
real 0m0.512s
user 0m0.396s
sys 0m0.116s
# time awk 'NR<2 || NR>55{print}' file1 > awktest
real 0m1.355s
user 0m1.264s
sys 0m0.088s
# diff perltest awktest
# diff perltest sedtest
# diff perltest pytest
# ls -l perltest awktest sedtest pytest |awk '{print $5,$9}'
19998921 awktest
19998921 perltest
19998921 pytest
19998921 sedtest
#
you might probably come back with a better Perl version, however, the point i am illustrating is, comparing languages is often a "sensitive" topic. So i would rather stick to what the OP actually wants. He wanted a Python solution, so we give him one (not for homework of course, but even if it is, ...). I am not against people suggestion other languages, because ultimately the decision to use which one lies upon OP himself and because its a forum, however when people want to make claims like "this is better than that", then it should be substantiated.
this will depend on what you use the file for (that is, what program reads it). But you could just write some obviousness (like "--DELETED--") at the front of the lines you wanted to delete. If the file is seriously huge, that will be much quicker since whatever-you-use doesn't have to copy the rest of the file to "shift up" the lines you're keeping. Then just educate/extend whatever uses the file to understand it should ignore the marked lines. Bonus points for cleaning the file up once in a while by actually making a pass to delete the marked lines...
This is an interesting idea. However, it cannot be done with commands like GNU sed. The "in-place" option on GNU sed is a misnomer, it does not write in-place:
Quote:
`--in-place[=SUFFIX]'
This option specifies that files are to be edited in-place. GNU
`sed' does this by creating a temporary file and sending output to
this file rather than to the standard output.(1).
This option implies `-s'.
When the end of the file is reached, the temporary file is renamed
to the output file's original name. The extension, if supplied,
is used to modify the name of the old file before renaming the
temporary file, thereby making a backup copy(2)).
-- excerpt from info sed
In other words, sed is doing a little of the work to save you the trouble of writing those actions in your shell script. The sed command is still writing the entire file (less any deletes and including any changes or additions) to the temporary.
I have not tried a real write in-place. I can think of it working like this: one would need to have the byte address of the line of interest -- by reading to that line, and remembering the byte position. You edit the line in memory, seek to the byte address, and write. I know there are some systems where that is possible, but I have not done it in *nix. You and the filesystem would need to agree to re-write the block in which the line is contained.
At any rate, it's not doable with sed as far as I know.
If the files are large enough and there are a lot of them, then this might justify creating a custom program to perform the action -- if the re-writing behavior is allowable by the filesystem. I quickly looked through Stevens' Advanced Programming in the UNIX Environment, and it looked like unbuffered IO might work, but someone with more experience would need to advise at this point ... cheers, makyo
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.