LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 09-27-2017, 04:49 PM   #1
Eros He
LQ Newbie
 
Registered: Sep 2017
Posts: 4

Rep: Reputation: Disabled
Question Deleting a line with Nth occurence of anything.


I have a huge file with random numbers, such as:
122133411
213332213
and so on...
I used sed '/something/d' to delete some lines with specific occurences, but it turned out i was looking at 90x7 different commands to do what i want, which is to delete every line in which there is a 4th occurence of any number. Can you help me? Thanks, in advance...

Last edited by Eros He; 09-27-2017 at 04:55 PM.
 
Old 09-27-2017, 06:25 PM   #2
Sefyir
Member
 
Registered: Mar 2015
Distribution: Linux Mint
Posts: 607

Rep: Reputation: 301Reputation: 301Reputation: 301Reputation: 301
A lot of languages can do this.
The basic steps would be to take each line, count the number of occurrences of a specific character in that line and check if it's to many, then don't print it if it is too many

Two examples with python: if 3 occurs less then 2 times, print it out.

Code:
with open('file.txt') as file:
    for line in file:
        if not line.count('3') >= 2:
            print(line, end='')
            
with open('file.txt') as file:
    print(*(line for line in file if not line.count('3') >= 2))
 
1 members found this post helpful.
Old 09-27-2017, 06:29 PM   #3
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 1,293

Rep: Reputation: 595Reputation: 595Reputation: 595Reputation: 595Reputation: 595Reputation: 595
Delete the 4th occurrence of each number
Code:
awk '++s[$1]!=4' file
Delete the 4th, 5th, ... occurrence of each number
Code:
awk '++s[$1]<4' file
 
1 members found this post helpful.
Old 09-28-2017, 02:42 PM   #4
Eros He
LQ Newbie
 
Registered: Sep 2017
Posts: 4

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by Sefyir View Post
A lot of languages can do this.
The basic steps would be to take each line, count the number of occurrences of a specific character in that line and check if it's to many, then don't print it if it is too many

Two examples with python: if 3 occurs less then 2 times, print it out.

Code:
with open('file.txt') as file:
    for line in file:
        if not line.count('3') >= 2:
            print(line, end='')
            
with open('file.txt') as file:
    print(*(line for line in file if not line.count('3') >= 2))
I am sorry, just started learning python, yesterday, actually.. i have no idea how to implement this code or work with it.. can you help me out? I mean, if a string has 123421131 i want it deleted for having four "1", the same with any other number...
 
Old 09-28-2017, 04:08 PM   #5
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 1,293

Rep: Reputation: 595Reputation: 595Reputation: 595Reputation: 595Reputation: 595Reputation: 595
Ah, I thought you meant the 4th repetition of the whole line.
Detecting the 4th repetition within a line can be done by a regular expression with (4 times) a backreference, in sed or grep.
Code:
grep -v '\(.\).*\1.*\1.*\1' file
Code:
sed '/\(.\).*\1.*\1.*\1/d' file
 
Old 09-28-2017, 06:57 PM   #6
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 18,493

Rep: Reputation: 3099Reputation: 3099Reputation: 3099Reputation: 3099Reputation: 3099Reputation: 3099Reputation: 3099Reputation: 3099Reputation: 3099Reputation: 3099Reputation: 3099
I think the OP had better sit down and clearly define the requirements.
Maybe, just maybe, it isn't just the first character. Or maybe it is. Who knows.
 
Old 09-29-2017, 06:58 AM   #7
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 5,363

Rep: Reputation: 1980Reputation: 1980Reputation: 1980Reputation: 1980Reputation: 1980Reputation: 1980Reputation: 1980Reputation: 1980Reputation: 1980Reputation: 1980Reputation: 1980
gawk has a feature that may be useful. https://www.gnu.org/software/gawk/ma...aracter-Fields
Code:
echo 987654321 122133411 213332213 123456789 | gawk 'BEGIN { FS = "" ; RS = " "}
{drop = 0
for (i = 1; i <= NF; i++) {a[$i]++}
for (i in a) if (a[i] > 3) drop = 1
if (drop != 1) print $0
delete a
}'
produces
Quote:
987654321
123456789
 
Old 09-29-2017, 06:58 AM   #8
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 5,363

Rep: Reputation: 1980Reputation: 1980Reputation: 1980Reputation: 1980Reputation: 1980Reputation: 1980Reputation: 1980Reputation: 1980Reputation: 1980Reputation: 1980Reputation: 1980
deleted as requested
PS Thanks to MadeInGermany for the improvement below.

Last edited by allend; 09-29-2017 at 09:00 AM.
 
Old 09-29-2017, 07:22 AM   #9
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 1,293

Rep: Reputation: 595Reputation: 595Reputation: 595Reputation: 595Reputation: 595Reputation: 595
allend, please delete your duplicate post!

If you have RS=" " then the ending \n from the echo will be procecced as a character (in effect only outputs an extra line feed).
So either have echo -n, or allow \n in RS, that makes it also more versatile because it allows the multi-line input as in post#1.

Last but not least, the awk code can be simplified: increment and test at the same time.
Code:
echo 987654321 122133411 213332213 123456789 | gawk 'BEGIN { FS = "" ; RS = "[ \n]"}
{drop = 0
for (i = 1; i <= NF; i++) if (++a[$i] >= 4) drop = 1
if (!drop) print      
delete a
}'
 
2 members found this post helpful.
Old 09-29-2017, 08:10 AM   #10
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 9,078
Blog Entries: 4

Rep: Reputation: 3179Reputation: 3179Reputation: 3179Reputation: 3179Reputation: 3179Reputation: 3179Reputation: 3179Reputation: 3179Reputation: 3179Reputation: 3179Reputation: 3179
If you want to remove "occurrences of a whole line," in a very large file, consider sorting the file. Now, all occurrences of the same value will be consecutive, and identifying/removing duplicates is trivial: you need only compare a record to its immediate predecessor. It is equally trivial to recognize gaps, to "merge" identically-sorted files, and so on.

Sorting is a very heavily-studied group of algorithms (hence Dr. Knuth's Sorting and Searching), and can be performed very rapidly.

- - - - -

When you saw "all those spinning tape-drives" in campy old sci-fi movies, that's what they were supposedly doing. They used tape-sort algorithms to sort the contents of an input tape, then merged them with already-sorted master tapes. Very large amounts of data can be efficiently processed in this way, with little memory.

A generation before that, punched cards were used to do the same thing, and in the same way.

Last edited by sundialsvcs; 09-29-2017 at 08:12 AM.
 
1 members found this post helpful.
Old 09-30-2017, 12:10 PM   #11
Eros He
LQ Newbie
 
Registered: Sep 2017
Posts: 4

Original Poster
Rep: Reputation: Disabled
Thanks, everyone!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] grep for pattern following the nth occurence of a character in a file cosminel Linux - Newbie 24 10-08-2013 03:34 AM
AWK/BASH: get nth line from a file by getline feed to actions in a same awk line cristalp Programming 3 11-23-2011 12:38 PM
print lines form nth line to mth line which fulfill specific condition cristalp Programming 4 11-07-2011 08:39 AM
print nth line after the line which matches the string cristalp Programming 7 10-27-2011 02:53 PM
nth line of a file in perl kadhan Programming 1 02-20-2008 12:15 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 03:59 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration