LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Script to remove lines in a file with more than "x" instances of any character ? (https://www.linuxquestions.org/questions/programming-9/script-to-remove-lines-in-a-file-with-more-than-x-instances-of-any-character-836230/)

pissed_budgie 10-04-2010 10:55 PM

Script to remove lines in a file with more than "x" instances of any character ?
 
Hi,

I'm looking for a script (bash, python, perl etc) or even a one liner (sed, awk etc) that can take a set of files and remove any line that has more than "x" instances of any character (case sensitive). I have been doing a lot of searching and can only come up with examples of how to remove blank lines, lines that start with a certain character or lines that contain a certain string.
This will be used on a system running a Kubuntu derivative.

As a very poor and basic example, I would like to take files that contain lines like:

Code:

ABC123#()
AbAcA123#
#AB32(1)C
AAABC123#
AAaBC123#
Aabcbcb##
#ab##c231

and end up with the files only containing the lines:

Code:

ABC123#()
#AB32(1)C
AAaBC123#

if I tell the script that 2 is the maximun number of times any character can appear in any line.

I hope that makes sense.

I know this must be possible, but for the life of me I cannot find even an example that will lead me in the right direction or better yet a piece of code I can use.

Thank you for taking a look at my post and I hope it's not me missing an obvious way of doing this.

grail 10-04-2010 11:22 PM

So are you telling it which characters to look for or only that it cannot contain more than 2 (for example) of any character?

pissed_budgie 10-04-2010 11:41 PM

it cannot contain more than 2 of any character.

AAab = keep (because case sensitive)
AAAb = delete (because of 3 A's)
AbCb = keep (only 2 chars the same)
#b## = delete (because of 3 #'s)

Thanks for the interest.

Kenhelm 10-05-2010 12:05 AM

Try
Code:

n=2
echo 'ABC123#()
AbAcA123#
#AB32(1)C
AAABC123#
AAaBC123#
Aabcbcb##
#ab##c231' | grep -Ev "(.)(.*\1){$n}"

ABC123#()
#AB32(1)C
AAaBC123#


ghostdog74 10-05-2010 12:10 AM

Code:

awk -vFS= '{
    for(i=1;i<=NF;i++){
      a[$i]++;
      if(a[$i]>2){ f=1; break }
    }
    delete a
    if(f){f=0;next}
}1' file


grail 10-05-2010 12:42 AM

Nice one Ken :)

pissed_budgie 10-05-2010 01:30 AM

Quote:

Originally Posted by Kenhelm (Post 4117911)
Try
Code:

n=2
echo 'ABC123#()
AbAcA123#
#AB32(1)C
AAABC123#
AAaBC123#
Aabcbcb##
#ab##c231' | grep -Ev "(.)(.*\1){$n}"

ABC123#()
#AB32(1)C
AAaBC123#


This works perfectly for the small example I gave, but the files are too large and too numerous to do by hand like this.
Would it be possible to make it so I can:

script.sh -n 2 -f *.txt

and have it process all the files -f *.txt
have the n input as the script is run as this number can change depending on the files processed
modify the existing files or create new ones with the same name and delete the old ones ?

I know I have a real cheek and am probably pushing my luck asking for that, but it is obvious that you could do this far easier than I could.

Really nice simple solution, thank you so much for what you have given me.

pissed_budgie 10-05-2010 01:33 AM

Quote:

Originally Posted by ghostdog74 (Post 4117916)
Code:

awk -vFS= '{
    for(i=1;i<=NF;i++){
      a[$i]++;
      if(a[$i]>2){ f=1; break }
    }
    delete a
    if(f){f=0;next}
}1' file


Thank you.

I tried this but although I could see it running through the file line by line, it neither changed the file nor create a new file with only the required lines.
Sorry

grail 10-05-2010 02:15 AM

So you asked for a solution and a few were provided and then when needing to have it run on a large amount of data you want someone to do the next step too?

Remember this is supposed to be a learning experience. What have you tried in the way of implementing the grep solution on multiple files?
Not that I would recommend it as it may never finish but grep itself has a -r option for recursive looking.

As for the awk:
Quote:

I could see it running through the file line by line
yes it shows the items to be kept so redirect to a new file and you will have your data.

pissed_budgie 10-05-2010 02:37 AM

Quote:

Originally Posted by grail (Post 4118025)
So you asked for a solution and a few were provided and then when needing to have it run on a large amount of data you want someone to do the next step too?

Remember this is supposed to be a learning experience. What have you tried in the way of implementing the grep solution on multiple files?
Not that I would recommend it as it may never finish but grep itself has a -r option for recursive looking.

As for the awk:

yes it shows the items to be kept so redirect to a new file and you will have your data.

Sorry about that, no harm meant by it.
I see what you are saying and will work out the finer refinements myself
Thanks for the pointing out the obvious I totally missed.

And a big thank you to the people that provided me with a code snippet to build from.

grail 10-05-2010 04:09 AM

No probs ... just post when you get stuck :)

I would also suggest looking at something like:
Code:

while read -r line
do
    <your stuff here>
done< <(find <where your looking> -type f -name "what your looking for")


vinaytp 10-05-2010 10:46 AM

Hi pissed_budgie,

In perl

Code:

#!/usr/bin/perl
open(HANDLE, "$ARGV[0]");
while (<HANDLE>)
{
        chomp;
        if (!/(.)(.*\1){$ARGV[1]}/)
        {
        print "$_\n";
        }
}
close(HANDLE);

Execute test.pl by passing arguments
Code:

perl test.pl file 2
Warm Regards,

pissed_budgie 10-08-2010 08:16 PM

Thanks for all the replies and code snippets, I can't believe how simple this turned out to be.


All times are GMT -5. The time now is 11:06 PM.