LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Need help writing a script to remove lines in which >X% of the characters are dashes (https://www.linuxquestions.org/questions/linux-newbie-8/need-help-writing-a-script-to-remove-lines-in-which-x-of-the-characters-are-dashes-772780/)

kmkocot 12-01-2009 04:05 PM

Need help writing a script to remove lines in which >X% of the characters are dashes
 
Hi all,

I have another request for scripting help :D

I have a folder with many input files formatted like this:
Code:

>14432|LGIG|186221
MVMISVLAMALNRITAAEKR
>14432|CAP1|21057
MVRVNVLADALKSI-TAEKR
>14432|HROB|156827
--RMNVLADALKSICNAEKR
>14432|NVEC|159589
MVRVNVLNDALNSICNAEKR
>14432|NEOM|short
----------------AEKR

Each pair of lines represents one amino acid sequence where the first line is the header and the second line is the sequence data. Dashes represent missing data (or gaps). I would like to write a script that searches through each file and removes sequences that contain >X% missing data.

My first thought was to 1) count the number of characters per line, 2) count the number of characters per line that aren't dashes, 3) combine those outputs into one file with two columns, 4) divide $2 by $1, and 5) remove lines in the original file in which that output value falls below the threshold. I can't figure out how to do step 5, though. Am I making this harder than it is? Any suggestions would be greatly appreciated.

Here's the code for where I was heading.
Code:

for file in *.fa
do
while read line; do echo "$line" | tr -cd  [:alnum:][:graph:] | wc -c; done < 14432.fa > output1.txt
#prints the number of characters on each line

while read line; do echo "$line" | tr -cd  [:alnum:][:graph:] | sed 's/-//g' | wc -c; done < 14432.fa > output2.txt
#prints the number of characters on each line that aren't -

paste output1.txt output2.txt > output3.txt

done
awk '{ score = $2 / $1; print score }' output3.txt

#5?

Thanks a ton!
Kevin

GrapefruiTgirl 12-01-2009 04:27 PM

Hmm.. Will the data sequences you wish to remove, always have the same number of ----'s in them?

And, do you want to remove the header for the removed data sequence as well? (probably yes)

And that's it? If so, I think you're over-complicating the task, but I commend your progress so far.

I'll fiddle with it in a few minutes, if I'm understanding the problem correctly.

Sasha

pixellany 12-01-2009 05:20 PM

We just MUST feed this into the SED vs. AWK battle......;)

chrism01 12-01-2009 09:43 PM

Soln1 (remove bad seq recs only)
Code:

#!/usr/bin/perl -w
use strict;
my (
    $limit, $len,  $rec, $cnt
  );

# Remove lines with eg 50 % '---'
$limit = .5;
open( FILE, "<", "t.t" ) or
      die "Can't open file: t.t: $!\n";

while ( defined ( $rec = <FILE> ) )
{
  # Remove unwanted chars
  chomp $rec;                # newline
    # if rec starts with '-'
    if( $rec =~ /^-/ )
    {
        $len = length($rec);
        $cnt = ($rec =~ tr/-//);  # count num of '-'s
        next if( $cnt/$len > $limit );  # pcent > limit; skip rec
    }
    print "$rec\n";
}

close(FILE) or die "Can't close file: t.t: $!\n";

Soln 2 (remove bad seq rec and associated hdr rec as per GrapefruiTgirl)
Code:

#!/usr/bin/perl -w
use strict;
my (
    $limit, $len,  $rec, $cnt, $hdr_rec
  );

# Remove lines with eg 50 % '---'
$limit = .5;
open( FILE, "<", "t.t" ) or
      die "Can't open file: t.t: $!\n";
while ( defined ( $rec = <FILE> ) )
{
    # Remove unwanted chars
    chomp $rec;                # newline

    # check for header rec
    if( $rec =~ /^>/ )
    {
        $hdr_rec = $rec;
        next;
    }

    # Check for sequence rec
    if( $rec =~ /^-/ )
    {
        $len = length($rec);
        $cnt = ($rec =~ tr/-//);  # count num of '-'s
        next if( $cnt/$len > $limit );  # pcent > limit; skip rec
    }
    print "$hdr_rec\n";
    print "$rec\n";
}

close(FILE) or die "Can't close file: t.t: $!\n";

based on copy of OP's data

exvor 12-01-2009 09:45 PM

Quote:

Originally Posted by pixellany (Post 3776273)
We just MUST feed this into the SED vs. AWK battle......;)

Stirring up the natives again huh pixel...

GrapefruiTgirl 12-01-2009 10:19 PM

strip dna sequences.
 
LOL.. Battle indeed!

How about something somewhat simple; Here's some tac and sed:

I made the input file called "dna". Into it, I put a bunch of your sample file from post 1, just copied and pasted it in there. The script is called "dnaclean"

dna before operation:

Quote:

sasha@reactor:~/test$ cat dna
>14432|CAP1|21057
MVRVNVLADALKSI-TAEKR
>14432|HROB|156827
--RMNVLADALKSICNAEKR
>14432|NVEC|159589
MVRVNVLNDALNSICNAEKR
>14432|CAP1|21057
MVRVNVLADALKSI-TAEKR
>14432|HROB|156827
--RMNVLADALKSICNAEKR
>14432|NVEC|159589
MVRVNVLNDALNSICNAEKR
>14432|NEOM|short
----------------AEKR
>14432|CAP1|21057
MVRVNVLADALKSI-TAEKR
>14432|HROB|156827
--RMNVLADALKSICNAEKR
>14432|NVEC|159589
MVRVNVLNDALNSICNAEKR
>14432|NEOM|short
----------------AEKR
>14432|CAP1|21057
MVRVNVLADALKSI-TAEKR
>14432|HROB|156827
--RMNVLADALKSICNAEKR
>14432|NVEC|159589
MVRVNVLNDALNSICNAEKR
>14432|NEOM|short
----------------AEKR
dna after operation:

Quote:

sasha@reactor:~/test$ sh dnaclean
sasha@reactor:~/test$ cat dna
>14432|CAP1|21057
MVRVNVLADALKSI-TAEKR
>14432|HROB|156827
--RMNVLADALKSICNAEKR
>14432|NVEC|159589
MVRVNVLNDALNSICNAEKR
>14432|CAP1|21057
MVRVNVLADALKSI-TAEKR
>14432|HROB|156827
--RMNVLADALKSICNAEKR
>14432|NVEC|159589
MVRVNVLNDALNSICNAEKR
>14432|CAP1|21057
MVRVNVLADALKSI-TAEKR
>14432|HROB|156827
--RMNVLADALKSICNAEKR
>14432|NVEC|159589
MVRVNVLNDALNSICNAEKR
>14432|CAP1|21057
MVRVNVLADALKSI-TAEKR
>14432|HROB|156827
--RMNVLADALKSICNAEKR
>14432|NVEC|159589
MVRVNVLNDALNSICNAEKR
sasha@reactor:~/test$
dnaclean:

Code:

#!/bin/bash
# script is called "dnaclean"
# strips headers and sequences with four dashes.

input_filename="dna"

count=0
for i in $(tac $input_filename); do
element[$count]=$i
count=$((count+1))
done

for ((a=0;a<=$count;a++)); do
if [ "$(echo ${element[a]} | grep -- ----)" ]; then
sed -i "/${element[a]}/d;/${element[((a+1))]}/d" "$input_filename";
a=$((a+1))
fi;
done

It should be rather trivial to edit the script to operate on every file in a given directory.

PS - It just occurred to me that the dashes may not be necessarily grouped together, in which case this won't work. Maybe that's why you wanted to count the dashes? Ok, see below:

Code:

#!/bin/bash
# script is called "dnaclean"
# strips headers and sequences with three dashes.

input_filename="dna"
newfile="$input_filename-new"
rm $newfile

dashes=3

fields="$((dashes+1))"
count=0

for i in $(tac $input_filename); do
element[$count]=$i
count=$((count+1))
done

for ((a=0;a<=$count;a++)); do
b=$((a+1))
if [ ! "$(echo "${element[a]}" | gawk -v FS=- '{print NF}')" = "$fields" ]; then
echo ${element[a]} >> $newfile
echo ${element[b]} >> $newfile
fi;
a=$((a+1))
done

tac $newfile > $newfile-final

Sasha

PS - Ghostdog's script below is wonderfully compact -- so much so that I don't know how it works :scratch: -- there's no mention of the 'dash' in there. Hmmm.

ghostdog74 12-01-2009 10:59 PM

gawk
Code:

gawk -vRS="\n>" '!/--+/&&NR>1{print RT,$0}NR==1&&!/--+/' file

Kenhelm 12-02-2009 02:59 AM

This uses the GNU sed 'e' command to execute some bash commands in the sed pattern space.
Code:

X=50      # Allowable missing data %
bashcode='sequ=\1;dashes=${sequ//[^-]/};echo $(( ${#dashes}*100/${#sequ} > '$X' ))'
sed -rn "/>/{N;h; s:.*\n(.*):$bashcode:e; /0/{x;p}}" infile > outfile

EDIT
This should run faster as it restricts the bash code just to those sequences which have a '-' in them instead of doing the calculation for every sequence.
Code:

X=50      # Allowable missing data %
bashcode='sequ=\1;dashes=${sequ//[^-]/};echo $(( ${#dashes}*100/${#sequ} > '$X' ))'
sed -r "/>/{N; /-/{h; s:.*\n(.*):$bashcode:e; /1/d; /0/x}}" infile > outfile


kmkocot 12-02-2009 04:09 PM

I love this site.

Chris: Thank you for taking a whack at it. Solution 2 is close but there are some problems. Try it with the input below. It removes FAKE_100% and FAKE_LAST but that is all. Also, I fully admit I know nothing about perl so don't laugh... is there a way to do something like "for file in *.txt do..." with perl so I can automate the task?
Code:

>FAKE_0%
MVMISVLAMALNRITAAERR
>FAKE_100%
--------------------
>FAKE_60%
MVRMNVLA------------
>FAKE_55%
MVRMNVLAD-----------
>FAKE_50%
MVRMNVLADA----------
>FAKE_45%
MVRMNVLADAL---------
>FAKE_40%
MVMISVLAMALN--------
>FAKE_35%
MVMISVLAMALNR-------
>FAKE_30%
MVMISVLAMALNRI------
>FAKE_first
M-------------------
>FAKE_first_last
M------------------R
>FAKE_last
-------------------R

Sasha: The number of characters on the even numbered lines (lines NOT containing greater-than signs) varies from file to file but within a file, the number of characters on those lines should be the same in each sequence. When I delete a sequence I want to remove both the header (line containing the greater-than sign) and the sequence itself (line containing dashes and A-Z). You are right that the dashes might not always be found together and the number and pattern they make up is variable. The second script is good but it would be nice if I could set it at a percentage cutoff so I could run it on all files at once.

Ghostdog: When I ran your awk script on the example input file above it returned
Code:

>FAKE_0%
MVMISVLAMALNRITAAERR

I will try to wrap my head around this but for now is there an easy way to adjust the % dash cutoff?

Ken: I ran your sed script on the example input file above but it returned this:
Code:

sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution

I'll work on this one as well but if you know what the problem might be, please let me know.

Thanks!
Kevin

GrapefruiTgirl 12-02-2009 06:12 PM

bash + bc
 
OK, here's the version which deals with the dashes as a % of the string, and removes string pairs with a larger % of dashes than the $percent variable.

Code:

#!/bin/bash
# script is called "dnaclean"
# strips headers and sequences with >X% of dashes.
percent=10

input_filename="dna"
# backup inputfile to .OLD
cp $input_filename "$input_filename.OLD"
# temporary filename:
newfile="$input_filename-new"
# make sure it don't exist yet:
rm $newfile 2>>/dev/null

# stick file data into array:
count=-1
for i in $(tac $input_filename); do
 count=$((count+1))
 element[$count]=$i
done

# run thru array checking percents of dashes:
a=-2
while [ $a -le $count ]; do
 a=$((a+2)); b=$((a+1))
 fields="$(echo "${element[a]}" | gawk -v FS=- '{print NF}')"
 [ "$fields" = "0" ] && fields="1"
 dashes=$((fields - 1)); elemnta=${element[a]}; strlen=${#elemnta}
 [ "$strlen" = "0" ] && break

 percent_of_string_are_dashes="$(echo "$dashes / $strlen * 100" | bc -l)"
 calculate_diff="$(echo "$percent_of_string_are_dashes > $percent" | bc -l)"
 if [ "$calculate_diff" = "0" ]; then
  echo ${element[a]} >> $newfile
  echo ${element[b]} >> $newfile
 fi;
done

# put data back in right order
# and put edited file in place of original
# and remove temp file:
if [ -f $newfile ]; then
 tac $newfile > $newfile-final
 mv -f $newfile-final $input_filename
else
mv -f "$input_filename.OLD" "$input_filename"
fi;

Cheers,
Sasha

chrism01 12-02-2009 06:23 PM

Ah well, that data is significantly different ;)

Try this
Code:

#!/usr/bin/perl -w
use strict;
my (
    $infile, $limit, $len,  $rec, $cnt, $hdr_rec
  );

# Remove lines with eg 50 % '---'
$limit = .5;

for $infile ( @ARGV )
{
    open( FILE, "<", "$infile" ) or
          die "Can't open file: $infile: $!\n";
    while ( defined ( $rec = <FILE> ) )
    {
        # Remove unwanted chars
        chomp $rec;                # newline

        # check for header rec
        if( $rec =~ /^>/ )
        {
            $hdr_rec = $rec;
            next;
        }

        # Check sequence rec for '-'
        $len = length($rec);
        $cnt = ($rec =~ tr/-//);  # count num of '-'s
        next if( $cnt/$len > $limit );  # pcent > limit; skip rec

        print "$hdr_rec\n";
        print "$rec\n";
    }

    close(FILE) or die "Can't close file: $infile: $!\n";
}

called as

./lq2.pl t.t t1.t
or you can use wildcards, remembering not to include the program in the list eh?
:)

ghostdog74 12-02-2009 07:39 PM

Quote:

Originally Posted by kmkocot;3777621

Ghostdog: When I ran your awk script on the example input file above it returned
[CODE
>FAKE_0%
MVMISVLAMALNRITAAERR
[/CODE]

I may have interpreted your output requirement wrongly, but isn't that what you want ? remove those sequences with missing data (dashes)? from your sample, the above doesn't have missing data and the rest is removed.

chrism01 12-02-2009 09:59 PM

@Ghostdog; actually he wants to be able to specify a fraction/percentage of dashes that determines whether a pair of lines is removed. See the $limit var in my code; I've set it to 0.5 (ie 50%).

Kenhelm 12-02-2009 10:53 PM

The code I posted runs without giving error messages when I try it on both these systems:
GNU bash, version 3.00.16(2)-release (i586-mandriva-linux-gnu)
GNU sed version 4.1.4
and
GNU bash, version 4.00.33(2)-release (i586-mandriva-linux-gnu)
GNU sed version 4.2
Which versions are you using?

The 'sh: Bad substitution' message might be from
${sequ//[^-]/}
or
${#dash}*100/${#sequ}

Try replacing
dashes=${sequ//[^-]/}
with
dashes=$(echo "$sequ" | tr -cd "-")

Code:

#!/bin/bash
X=50      # Allowable missing data %
bashcode='sequ=\1;dashes=$(echo "$sequ" | tr -cd "-");echo $(( ${#dashes}*100/${#sequ} > '$X' ))'
sed -r "/>/{N; /-/{h; s:.*\n(.*):$bashcode:e; /1/d; /0/x}}" infile > outfile


kmkocot 12-02-2009 11:27 PM

Proof that there is more than one way to skin a cat. All three of those work wonderfully. Once again, thank you very much for your help.


All times are GMT -5. The time now is 03:59 AM.