LinuxQuestions.org - [SOLVED] Need help writing a script to remove lines in which >X% of the characters are dashes

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - Need help writing a script to remove lines in which >X% of the characters are dashes (https://www.linuxquestions.org/questions/linux-newbie-8/need-help-writing-a-script-to-remove-lines-in-which-x-of-the-characters-are-dashes-772780/)

Need help writing a script to remove lines in which >X% of the characters are dashes

Hi all,

I have another request for scripting help :D

I have a folder with many input files formatted like this:

Code:

>14432|LGIG|186221

MVMISVLAMALNRITAAEKR

>14432|CAP1|21057

MVRVNVLADALKSI-TAEKR

>14432|HROB|156827

--RMNVLADALKSICNAEKR

>14432|NVEC|159589

MVRVNVLNDALNSICNAEKR

>14432|NEOM|short

----------------AEKR

Each pair of lines represents one amino acid sequence where the first line is the header and the second line is the sequence data. Dashes represent missing data (or gaps). I would like to write a script that searches through each file and removes sequences that contain >X% missing data.

My first thought was to 1) count the number of characters per line, 2) count the number of characters per line that aren't dashes, 3) combine those outputs into one file with two columns, 4) divide $2 by $1, and 5) remove lines in the original file in which that output value falls below the threshold. I can't figure out how to do step 5, though. Am I making this harder than it is? Any suggestions would be greatly appreciated.

Here's the code for where I was heading.

Code:

for file in *.fa

do

while read line; do echo "$line" | tr -cd  [:alnum:][:graph:] | wc -c; done < 14432.fa > output1.txt

#prints the number of characters on each line



while read line; do echo "$line" | tr -cd  [:alnum:][:graph:] | sed 's/-//g' | wc -c; done < 14432.fa > output2.txt

#prints the number of characters on each line that aren't -



paste output1.txt output2.txt > output3.txt



done

awk '{ score = $2 / $1; print score }' output3.txt



#5?

Thanks a ton!
Kevin

Hmm.. Will the data sequences you wish to remove, always have the same number of ----'s in them?

And, do you want to remove the header for the removed data sequence as well? (probably yes)

And that's it? If so, I think you're over-complicating the task, but I commend your progress so far.

I'll fiddle with it in a few minutes, if I'm understanding the problem correctly.

Sasha

We just MUST feed this into the SED vs. AWK battle......;)

Soln1 (remove bad seq recs only)

Code:

#!/usr/bin/perl -w

use strict;

my (

    $limit, $len,  $rec, $cnt

  );



# Remove lines with eg 50 % '---'

$limit = .5;

open( FILE, "<", "t.t" ) or

      die "Can't open file: t.t: $!\n";



while ( defined ( $rec = <FILE> ) )

{

  # Remove unwanted chars

  chomp $rec;                # newline

    # if rec starts with '-'

    if( $rec =~ /^-/ )

    {

        $len = length($rec);

        $cnt = ($rec =~ tr/-//);  # count num of '-'s

        next if( $cnt/$len > $limit );  # pcent > limit; skip rec

    }

    print "$rec\n";

}



close(FILE) or die "Can't close file: t.t: $!\n";

Soln 2 (remove bad seq rec and associated hdr rec as per GrapefruiTgirl)

Code:

#!/usr/bin/perl -w

use strict;

my (

    $limit, $len,  $rec, $cnt, $hdr_rec

  );



# Remove lines with eg 50 % '---'

$limit = .5;

open( FILE, "<", "t.t" ) or

      die "Can't open file: t.t: $!\n";

while ( defined ( $rec = <FILE> ) )

{

    # Remove unwanted chars

    chomp $rec;                # newline



    # check for header rec

    if( $rec =~ /^>/ )

    {

        $hdr_rec = $rec;

        next;

    }



    # Check for sequence rec

    if( $rec =~ /^-/ )

    {

        $len = length($rec);

        $cnt = ($rec =~ tr/-//);  # count num of '-'s

        next if( $cnt/$len > $limit );  # pcent > limit; skip rec

    }

    print "$hdr_rec\n";

    print "$rec\n";

}



close(FILE) or die "Can't close file: t.t: $!\n";

based on copy of OP's data

Quote:

Originally Posted by pixellany (Post 3776273)

We just MUST feed this into the SED vs. AWK battle......;)

Stirring up the natives again huh pixel...

strip dna sequences.

LOL.. Battle indeed!

How about something somewhat simple; Here's some tac and sed:

I made the input file called "dna". Into it, I put a bunch of your sample file from post 1, just copied and pasted it in there. The script is called "dnaclean"

dna before operation:

Quote:

dna after operation:

Quote:

dnaclean:

Code:

#!/bin/bash

# script is called "dnaclean"

# strips headers and sequences with four dashes.



input_filename="dna"



count=0

for i in $(tac $input_filename); do

element[$count]=$i

count=$((count+1))

done



for ((a=0;a<=$count;a++)); do

if [ "$(echo ${element[a]} | grep -- ----)" ]; then

sed -i "/${element[a]}/d;/${element[((a+1))]}/d" "$input_filename";

a=$((a+1))

fi;

done

It should be rather trivial to edit the script to operate on every file in a given directory.

PS - It just occurred to me that the dashes may not be necessarily grouped together, in which case this won't work. Maybe that's why you wanted to count the dashes? Ok, see below:

Code:

#!/bin/bash

# script is called "dnaclean"

# strips headers and sequences with three dashes.



input_filename="dna"

newfile="$input_filename-new"

rm $newfile



dashes=3



fields="$((dashes+1))"

count=0



for i in $(tac $input_filename); do

element[$count]=$i

count=$((count+1))

done



for ((a=0;a<=$count;a++)); do

b=$((a+1))

if [ ! "$(echo "${element[a]}" | gawk -v FS=- '{print NF}')" = "$fields" ]; then

echo ${element[a]} >> $newfile

echo ${element[b]} >> $newfile

fi;

a=$((a+1))

done



tac $newfile > $newfile-final

Sasha

PS - Ghostdog's script below is wonderfully compact -- so much so that I don't know how it works :scratch: -- there's no mention of the 'dash' in there. Hmmm.

gawk

Code:

gawk -vRS="\n>" '!/--+/&&NR>1{print RT,$0}NR==1&&!/--+/' file

This uses the GNU sed 'e' command to execute some bash commands in the sed pattern space.

Code:

X=50      # Allowable missing data %

bashcode='sequ=\1;dashes=${sequ//[^-]/};echo $(( ${#dashes}*100/${#sequ} > '$X' ))'

sed -rn "/>/{N;h; s:.*\n(.*):$bashcode:e; /0/{x;p}}" infile > outfile

EDIT
This should run faster as it restricts the bash code just to those sequences which have a '-' in them instead of doing the calculation for every sequence.

Code:

X=50      # Allowable missing data %

bashcode='sequ=\1;dashes=${sequ//[^-]/};echo $(( ${#dashes}*100/${#sequ} > '$X' ))'

sed -r "/>/{N; /-/{h; s:.*\n(.*):$bashcode:e; /1/d; /0/x}}" infile > outfile

I love this site.

Chris: Thank you for taking a whack at it. Solution 2 is close but there are some problems. Try it with the input below. It removes FAKE_100% and FAKE_LAST but that is all. Also, I fully admit I know nothing about perl so don't laugh... is there a way to do something like "for file in *.txt do..." with perl so I can automate the task?

Code:

>FAKE_0%

MVMISVLAMALNRITAAERR

>FAKE_100%

--------------------

>FAKE_60%

MVRMNVLA------------

>FAKE_55%

MVRMNVLAD-----------

>FAKE_50%

MVRMNVLADA----------

>FAKE_45%

MVRMNVLADAL---------

>FAKE_40%

MVMISVLAMALN--------

>FAKE_35%

MVMISVLAMALNR-------

>FAKE_30%

MVMISVLAMALNRI------

>FAKE_first

M-------------------

>FAKE_first_last

M------------------R

>FAKE_last

-------------------R

Sasha: The number of characters on the even numbered lines (lines NOT containing greater-than signs) varies from file to file but within a file, the number of characters on those lines should be the same in each sequence. When I delete a sequence I want to remove both the header (line containing the greater-than sign) and the sequence itself (line containing dashes and A-Z). You are right that the dashes might not always be found together and the number and pattern they make up is variable. The second script is good but it would be nice if I could set it at a percentage cutoff so I could run it on all files at once.

Ghostdog: When I ran your awk script on the example input file above it returned

Code:

>FAKE_0%

MVMISVLAMALNRITAAERR

I will try to wrap my head around this but for now is there an easy way to adjust the % dash cutoff?

Ken: I ran your sed script on the example input file above but it returned this:

Code:

sh: Bad substitution

sh: Bad substitution

sh: Bad substitution

sh: Bad substitution

sh: Bad substitution

sh: Bad substitution

sh: Bad substitution

sh: Bad substitution

sh: Bad substitution

sh: Bad substitution

sh: Bad substitution

I'll work on this one as well but if you know what the problem might be, please let me know.

Thanks!
Kevin

OK, here's the version which deals with the dashes as a % of the string, and removes string pairs with a larger % of dashes than the $percent variable.

Code:

#!/bin/bash

# script is called "dnaclean"

# strips headers and sequences with >X% of dashes.

percent=10



input_filename="dna"

# backup inputfile to .OLD

cp $input_filename "$input_filename.OLD"

# temporary filename:

newfile="$input_filename-new"

# make sure it don't exist yet:

rm $newfile 2>>/dev/null



# stick file data into array:

count=-1

for i in $(tac $input_filename); do

 count=$((count+1))

 element[$count]=$i

done



# run thru array checking percents of dashes:

a=-2

while [ $a -le $count ]; do

 a=$((a+2)); b=$((a+1))

 fields="$(echo "${element[a]}" | gawk -v FS=- '{print NF}')"

 [ "$fields" = "0" ] && fields="1"

 dashes=$((fields - 1)); elemnta=${element[a]}; strlen=${#elemnta}

 [ "$strlen" = "0" ] && break



 percent_of_string_are_dashes="$(echo "$dashes / $strlen * 100" | bc -l)"

 calculate_diff="$(echo "$percent_of_string_are_dashes > $percent" | bc -l)"

 if [ "$calculate_diff" = "0" ]; then

  echo ${element[a]} >> $newfile

  echo ${element[b]} >> $newfile

 fi;

done



# put data back in right order

# and put edited file in place of original

# and remove temp file:

if [ -f $newfile ]; then

 tac $newfile > $newfile-final

 mv -f $newfile-final $input_filename

else

mv -f "$input_filename.OLD" "$input_filename"

fi;

Cheers,
Sasha

Ah well, that data is significantly different ;)

Try this

Code:

#!/usr/bin/perl -w

use strict;

my (

    $infile, $limit, $len,  $rec, $cnt, $hdr_rec

  );



# Remove lines with eg 50 % '---'

$limit = .5;



for $infile ( @ARGV )

{

    open( FILE, "<", "$infile" ) or

          die "Can't open file: $infile: $!\n";

    while ( defined ( $rec = <FILE> ) )

    {

        # Remove unwanted chars

        chomp $rec;                # newline



        # check for header rec

        if( $rec =~ /^>/ )

        {

            $hdr_rec = $rec;

            next;

        }



        # Check sequence rec for '-'

        $len = length($rec);

        $cnt = ($rec =~ tr/-//);  # count num of '-'s

        next if( $cnt/$len > $limit );  # pcent > limit; skip rec



        print "$hdr_rec\n";

        print "$rec\n";

    }



    close(FILE) or die "Can't close file: $infile: $!\n";

}

called as

./lq2.pl t.t t1.t
or you can use wildcards, remembering not to include the program in the list eh?
:)

Quote:

Originally Posted by kmkocot;3777621

Ghostdog: When I ran your awk script on the example input file above it returned
[CODE

>FAKE_0%
MVMISVLAMALNRITAAERR
[/CODE]

I may have interpreted your output requirement wrongly, but isn't that what you want ? remove those sequences with missing data (dashes)? from your sample, the above doesn't have missing data and the rest is removed.

@Ghostdog; actually he wants to be able to specify a fraction/percentage of dashes that determines whether a pair of lines is removed. See the $limit var in my code; I've set it to 0.5 (ie 50%).

The code I posted runs without giving error messages when I try it on both these systems:
GNU bash, version 3.00.16(2)-release (i586-mandriva-linux-gnu)
GNU sed version 4.1.4
and
GNU bash, version 4.00.33(2)-release (i586-mandriva-linux-gnu)
GNU sed version 4.2
Which versions are you using?

The 'sh: Bad substitution' message might be from
${sequ//[^-]/}
or
${#dash}*100/${#sequ}

Try replacing
dashes=${sequ//[^-]/}
with
dashes=$(echo "$sequ" | tr -cd "-")

Code:

#!/bin/bash

X=50      # Allowable missing data %

bashcode='sequ=\1;dashes=$(echo "$sequ" | tr -cd "-");echo $(( ${#dashes}*100/${#sequ} > '$X' ))'

sed -r "/>/{N; /-/{h; s:.*\n(.*):$bashcode:e; /1/d; /0/x}}" infile > outfile

Proof that there is more than one way to skin a cat. All three of those work wonderfully. Once again, thank you very much for your help.