Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have a huge file with random numbers, such as:
122133411
213332213
and so on...
I used sed '/something/d' to delete some lines with specific occurences, but it turned out i was looking at 90x7 different commands to do what i want, which is to delete every line in which there is a 4th occurence of any number. Can you help me? Thanks, in advance...
A lot of languages can do this.
The basic steps would be to take each line, count the number of occurrences of a specific character in that line and check if it's to many, then don't print it if it is too many
Two examples with python: if 3 occurs less then 2 times, print it out.
Code:
with open('file.txt') as file:
for line in file:
if not line.count('3') >= 2:
print(line, end='')
with open('file.txt') as file:
print(*(line for line in file if not line.count('3') >= 2))
A lot of languages can do this.
The basic steps would be to take each line, count the number of occurrences of a specific character in that line and check if it's to many, then don't print it if it is too many
Two examples with python: if 3 occurs less then 2 times, print it out.
Code:
with open('file.txt') as file:
for line in file:
if not line.count('3') >= 2:
print(line, end='')
with open('file.txt') as file:
print(*(line for line in file if not line.count('3') >= 2))
I am sorry, just started learning python, yesterday, actually.. i have no idea how to implement this code or work with it.. can you help me out? I mean, if a string has 123421131 i want it deleted for having four "1", the same with any other number...
Ah, I thought you meant the 4th repetition of the whole line.
Detecting the 4th repetition within a line can be done by a regular expression with (4 times) a backreference, in sed or grep.
I think the OP had better sit down and clearly define the requirements.
Maybe, just maybe, it isn't just the first character. Or maybe it is. Who knows.
echo 987654321 122133411 213332213 123456789 | gawk 'BEGIN { FS = "" ; RS = " "}
{drop = 0
for (i = 1; i <= NF; i++) {a[$i]++}
for (i in a) if (a[i] > 3) drop = 1
if (drop != 1) print $0
delete a
}'
If you have RS=" " then the ending \n from the echo will be procecced as a character (in effect only outputs an extra line feed).
So either have echo -n, or allow \n in RS, that makes it also more versatile because it allows the multi-line input as in post#1.
Last but not least, the awk code can be simplified: increment and test at the same time.
Code:
echo 987654321 122133411 213332213 123456789 | gawk 'BEGIN { FS = "" ; RS = "[ \n]"}
{drop = 0
for (i = 1; i <= NF; i++) if (++a[$i] >= 4) drop = 1
if (!drop) print
delete a
}'
If you want to remove "occurrences of a whole line," in a very large file, consider sorting the file. Now, all occurrences of the same value will be consecutive, and identifying/removing duplicates is trivial: you need only compare a record to its immediate predecessor. It is equally trivial to recognize gaps, to "merge" identically-sorted files, and so on.
Sorting is a very heavily-studied group of algorithms (hence Dr. Knuth's Sorting and Searching), and can be performed very rapidly.
- - - - -
When you saw "all those spinning tape-drives" in campy old sci-fi movies, that's what they were supposedly doing. They used tape-sort algorithms to sort the contents of an input tape, then merged them with already-sorted master tapes. Very large amounts of data can be efficiently processed in this way, with little memory.
A generation before that, punched cards were used to do the same thing, and in the same way.
Last edited by sundialsvcs; 09-29-2017 at 07:12 AM.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.