Deleting a line with Nth occurence of anything.

Eros He · 09-27-2017, 03:49 PM

I have a huge file with random numbers, such as:
122133411
213332213
and so on...
I used sed '/something/d' to delete some lines with specific occurences, but it turned out i was looking at 90x7 different commands to do what i want, which is to delete every line in which there is a 4th occurence of any number. Can you help me? Thanks, in advance...

Sefyir · 09-27-2017, 05:25 PM

A lot of languages can do this.
The basic steps would be to take each line, count the number of occurrences of a specific character in that line and check if it's to many, then don't print it if it is too many

Two examples with python: if 3 occurs less then 2 times, print it out.

Code:

with open('file.txt') as file:
    for line in file:
        if not line.count('3') >= 2:
            print(line, end='')
            
with open('file.txt') as file:
    print(*(line for line in file if not line.count('3') >= 2))

MadeInGermany · 09-27-2017, 05:29 PM

Delete the 4th occurrence of each number

Code:

awk '++s[$1]!=4' file

Delete the 4th, 5th, ... occurrence of each number

Code:

awk '++s[$1]<4' file

Eros He · 09-28-2017, 01:42 PM

Quote:

Originally Posted by Sefyir

A lot of languages can do this.
The basic steps would be to take each line, count the number of occurrences of a specific character in that line and check if it's to many, then don't print it if it is too many

Two examples with python: if 3 occurs less then 2 times, print it out.

Code:

with open('file.txt') as file:
    for line in file:
        if not line.count('3') >= 2:
            print(line, end='')
            
with open('file.txt') as file:
    print(*(line for line in file if not line.count('3') >= 2))

I am sorry, just started learning python, yesterday, actually.. i have no idea how to implement this code or work with it.. can you help me out? I mean, if a string has 123421131 i want it deleted for having four "1", the same with any other number...

MadeInGermany · 09-28-2017, 03:08 PM

Ah, I thought you meant the 4th repetition of the whole line.
Detecting the 4th repetition within a line can be done by a regular expression with (4 times) a backreference, in sed or grep.

Code:

grep -v '\(.\).*\1.*\1.*\1' file

Code:

sed '/\(.\).*\1.*\1.*\1/d' file

syg00 · 09-28-2017, 05:57 PM

I think the OP had better sit down and clearly define the requirements.
Maybe, just maybe, it isn't just the first character. Or maybe it is. Who knows.

allend · 09-29-2017, 05:58 AM

gawk has a feature that may be useful. https://www.gnu.org/software/gawk/ma...aracter-Fields

Code:

echo 987654321 122133411 213332213 123456789 | gawk 'BEGIN { FS = "" ; RS = " "}
{drop = 0
for (i = 1; i <= NF; i++) {a[$i]++}
for (i in a) if (a[i] > 3) drop = 1
if (drop != 1) print $0
delete a
}'

produces

Quote:

987654321
123456789

allend · 09-29-2017, 05:58 AM

deleted as requested
PS Thanks to MadeInGermany for the improvement below.

MadeInGermany · 09-29-2017, 06:22 AM

allend, please delete your duplicate post!

If you have RS=" " then the ending \n from the echo will be procecced as a character (in effect only outputs an extra line feed).
So either have echo -n, or allow \n in RS, that makes it also more versatile because it allows the multi-line input as in post#1.

Last but not least, the awk code can be simplified: increment and test at the same time.

Code:

echo 987654321 122133411 213332213 123456789 | gawk 'BEGIN { FS = "" ; RS = "[ \n]"}
{drop = 0
for (i = 1; i <= NF; i++) if (++a[$i] >= 4) drop = 1
if (!drop) print      
delete a
}'

sundialsvcs · 09-29-2017, 07:10 AM

If you want to remove "occurrences of a whole line," in a very large file, consider sorting the file. Now, all occurrences of the same value will be consecutive, and identifying/removing duplicates is trivial: you need only compare a record to its immediate predecessor. It is equally trivial to recognize gaps, to "merge" identically-sorted files, and so on.

Sorting is a very heavily-studied group of algorithms (hence Dr. Knuth's Sorting and Searching), and can be performed very rapidly.

- - - - -

When you saw "all those spinning tape-drives" in campy old sci-fi movies, that's what they were supposedly doing. They used tape-sort algorithms to sort the contents of an input tape, then merged them with already-sorted master tapes. Very large amounts of data can be efficiently processed in this way, with little memory.

A generation before that, punched cards were used to do the same thing, and in the same way.

Eros He · 09-30-2017, 11:10 AM

Thanks, everyone!