Deleting a line with Nth occurence of anything.
I have a huge file with random numbers, such as:
122133411 213332213 and so on... I used sed '/something/d' to delete some lines with specific occurences, but it turned out i was looking at 90x7 different commands to do what i want, which is to delete every line in which there is a 4th occurence of any number. Can you help me? Thanks, in advance... |
A lot of languages can do this.
The basic steps would be to take each line, count the number of occurrences of a specific character in that line and check if it's to many, then don't print it if it is too many Two examples with python: if 3 occurs less then 2 times, print it out. Code:
with open('file.txt') as file: |
Delete the 4th occurrence of each number
Code:
awk '++s[$1]!=4' file Code:
awk '++s[$1]<4' file |
Quote:
|
Ah, I thought you meant the 4th repetition of the whole line.
Detecting the 4th repetition within a line can be done by a regular expression with (4 times) a backreference, in sed or grep. Code:
grep -v '\(.\).*\1.*\1.*\1' file Code:
sed '/\(.\).*\1.*\1.*\1/d' file |
I think the OP had better sit down and clearly define the requirements.
Maybe, just maybe, it isn't just the first character. Or maybe it is. Who knows. |
gawk has a feature that may be useful. https://www.gnu.org/software/gawk/ma...aracter-Fields
Code:
echo 987654321 122133411 213332213 123456789 | gawk 'BEGIN { FS = "" ; RS = " "} Quote:
|
deleted as requested
PS Thanks to MadeInGermany for the improvement below. |
allend, please delete your duplicate post!
If you have RS=" " then the ending \n from the echo will be procecced as a character (in effect only outputs an extra line feed). So either have echo -n, or allow \n in RS, that makes it also more versatile because it allows the multi-line input as in post#1. Last but not least, the awk code can be simplified: increment and test at the same time. Code:
echo 987654321 122133411 213332213 123456789 | gawk 'BEGIN { FS = "" ; RS = "[ \n]"} |
If you want to remove "occurrences of a whole line," in a very large file, consider sorting the file. Now, all occurrences of the same value will be consecutive, and identifying/removing duplicates is trivial: you need only compare a record to its immediate predecessor. It is equally trivial to recognize gaps, to "merge" identically-sorted files, and so on.
Sorting is a very heavily-studied group of algorithms (hence Dr. Knuth's Sorting and Searching), and can be performed very rapidly. - - - - - When you saw "all those spinning tape-drives" in campy old sci-fi movies, that's what they were supposedly doing. They used tape-sort algorithms to sort the contents of an input tape, then merged them with already-sorted master tapes. Very large amounts of data can be efficiently processed in this way, with little memory. A generation before that, punched cards were used to do the same thing, and in the same way. |
Thanks, everyone!
|
All times are GMT -5. The time now is 06:12 AM. |