LinuxQuestions.org
Latest LQ Deal: Linux Power User Bundle
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 12-01-2009, 04:05 PM   #1
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Rep: Reputation: 15
Need help writing a script to remove lines in which >X% of the characters are dashes


Hi all,

I have another request for scripting help

I have a folder with many input files formatted like this:
Code:
>14432|LGIG|186221
MVMISVLAMALNRITAAEKR
>14432|CAP1|21057
MVRVNVLADALKSI-TAEKR
>14432|HROB|156827
--RMNVLADALKSICNAEKR
>14432|NVEC|159589
MVRVNVLNDALNSICNAEKR
>14432|NEOM|short
----------------AEKR
Each pair of lines represents one amino acid sequence where the first line is the header and the second line is the sequence data. Dashes represent missing data (or gaps). I would like to write a script that searches through each file and removes sequences that contain >X% missing data.

My first thought was to 1) count the number of characters per line, 2) count the number of characters per line that aren't dashes, 3) combine those outputs into one file with two columns, 4) divide $2 by $1, and 5) remove lines in the original file in which that output value falls below the threshold. I can't figure out how to do step 5, though. Am I making this harder than it is? Any suggestions would be greatly appreciated.

Here's the code for where I was heading.
Code:
for file in *.fa
do
while read line; do echo "$line" | tr -cd  [:alnum:][:graph:] | wc -c; done < 14432.fa > output1.txt
#prints the number of characters on each line

while read line; do echo "$line" | tr -cd  [:alnum:][:graph:] | sed 's/-//g' | wc -c; done < 14432.fa > output2.txt
#prints the number of characters on each line that aren't -

paste output1.txt output2.txt > output3.txt

done
awk '{ score = $2 / $1; print score }' output3.txt

#5?
Thanks a ton!
Kevin

Last edited by kmkocot; 12-01-2009 at 04:07 PM.
 
Old 12-01-2009, 04:27 PM   #2
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551
Hmm.. Will the data sequences you wish to remove, always have the same number of ----'s in them?

And, do you want to remove the header for the removed data sequence as well? (probably yes)

And that's it? If so, I think you're over-complicating the task, but I commend your progress so far.

I'll fiddle with it in a few minutes, if I'm understanding the problem correctly.

Sasha
 
Old 12-01-2009, 05:20 PM   #3
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Arch/XFCE
Posts: 17,802

Rep: Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738Reputation: 738
We just MUST feed this into the SED vs. AWK battle......
 
Old 12-01-2009, 09:43 PM   #4
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.9, Centos 7.3
Posts: 17,374

Rep: Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383
Soln1 (remove bad seq recs only)
Code:
#!/usr/bin/perl -w
use strict;
my (
    $limit, $len,  $rec, $cnt
   );

# Remove lines with eg 50 % '---'
$limit = .5;
open( FILE, "<", "t.t" ) or
       die "Can't open file: t.t: $!\n";

while ( defined ( $rec = <FILE> ) )
{
   # Remove unwanted chars
   chomp $rec;                 # newline
    # if rec starts with '-'
    if( $rec =~ /^-/ )
    {
        $len = length($rec);
        $cnt = ($rec =~ tr/-//);   # count num of '-'s
        next if( $cnt/$len > $limit );  # pcent > limit; skip rec
    }
    print "$rec\n";
}

close(FILE) or die "Can't close file: t.t: $!\n";
Soln 2 (remove bad seq rec and associated hdr rec as per GrapefruiTgirl)
Code:
#!/usr/bin/perl -w
use strict;
my (
    $limit, $len,  $rec, $cnt, $hdr_rec
   );

# Remove lines with eg 50 % '---'
$limit = .5;
open( FILE, "<", "t.t" ) or
       die "Can't open file: t.t: $!\n";
while ( defined ( $rec = <FILE> ) )
{
    # Remove unwanted chars
    chomp $rec;                 # newline

    # check for header rec
    if( $rec =~ /^>/ )
    {
        $hdr_rec = $rec;
        next;
    }

    # Check for sequence rec
    if( $rec =~ /^-/ )
    {
        $len = length($rec);
        $cnt = ($rec =~ tr/-//);   # count num of '-'s
        next if( $cnt/$len > $limit );  # pcent > limit; skip rec
    }
    print "$hdr_rec\n";
    print "$rec\n";
}

close(FILE) or die "Can't close file: t.t: $!\n";
based on copy of OP's data
 
Old 12-01-2009, 09:45 PM   #5
exvor
Senior Member
 
Registered: Jul 2004
Location: Phoenix, Arizona
Distribution: Gentoo, LFS, Debian,Ubuntu
Posts: 1,537

Rep: Reputation: 87
Quote:
Originally Posted by pixellany View Post
We just MUST feed this into the SED vs. AWK battle......
Stirring up the natives again huh pixel...
 
Old 12-01-2009, 10:19 PM   #6
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551
strip dna sequences.

LOL.. Battle indeed!

How about something somewhat simple; Here's some tac and sed:

I made the input file called "dna". Into it, I put a bunch of your sample file from post 1, just copied and pasted it in there. The script is called "dnaclean"

dna before operation:

Quote:
sasha@reactor:~/test$ cat dna
>14432|CAP1|21057
MVRVNVLADALKSI-TAEKR
>14432|HROB|156827
--RMNVLADALKSICNAEKR
>14432|NVEC|159589
MVRVNVLNDALNSICNAEKR
>14432|CAP1|21057
MVRVNVLADALKSI-TAEKR
>14432|HROB|156827
--RMNVLADALKSICNAEKR
>14432|NVEC|159589
MVRVNVLNDALNSICNAEKR
>14432|NEOM|short
----------------AEKR
>14432|CAP1|21057
MVRVNVLADALKSI-TAEKR
>14432|HROB|156827
--RMNVLADALKSICNAEKR
>14432|NVEC|159589
MVRVNVLNDALNSICNAEKR
>14432|NEOM|short
----------------AEKR
>14432|CAP1|21057
MVRVNVLADALKSI-TAEKR
>14432|HROB|156827
--RMNVLADALKSICNAEKR
>14432|NVEC|159589
MVRVNVLNDALNSICNAEKR
>14432|NEOM|short
----------------AEKR
dna after operation:

Quote:
sasha@reactor:~/test$ sh dnaclean
sasha@reactor:~/test$ cat dna
>14432|CAP1|21057
MVRVNVLADALKSI-TAEKR
>14432|HROB|156827
--RMNVLADALKSICNAEKR
>14432|NVEC|159589
MVRVNVLNDALNSICNAEKR
>14432|CAP1|21057
MVRVNVLADALKSI-TAEKR
>14432|HROB|156827
--RMNVLADALKSICNAEKR
>14432|NVEC|159589
MVRVNVLNDALNSICNAEKR
>14432|CAP1|21057
MVRVNVLADALKSI-TAEKR
>14432|HROB|156827
--RMNVLADALKSICNAEKR
>14432|NVEC|159589
MVRVNVLNDALNSICNAEKR
>14432|CAP1|21057
MVRVNVLADALKSI-TAEKR
>14432|HROB|156827
--RMNVLADALKSICNAEKR
>14432|NVEC|159589
MVRVNVLNDALNSICNAEKR
sasha@reactor:~/test$
dnaclean:

Code:
#!/bin/bash
# script is called "dnaclean"
# strips headers and sequences with four dashes.

input_filename="dna"

count=0
for i in $(tac $input_filename); do
element[$count]=$i
count=$((count+1))
done

for ((a=0;a<=$count;a++)); do
if [ "$(echo ${element[a]} | grep -- ----)" ]; then
sed -i "/${element[a]}/d;/${element[((a+1))]}/d" "$input_filename";
a=$((a+1))
fi;
done
It should be rather trivial to edit the script to operate on every file in a given directory.

PS - It just occurred to me that the dashes may not be necessarily grouped together, in which case this won't work. Maybe that's why you wanted to count the dashes? Ok, see below:

Code:
#!/bin/bash
# script is called "dnaclean"
# strips headers and sequences with three dashes.

input_filename="dna"
newfile="$input_filename-new"
rm $newfile

dashes=3

fields="$((dashes+1))"
count=0

for i in $(tac $input_filename); do
element[$count]=$i
count=$((count+1))
done

for ((a=0;a<=$count;a++)); do
b=$((a+1))
if [ ! "$(echo "${element[a]}" | gawk -v FS=- '{print NF}')" = "$fields" ]; then
echo ${element[a]} >> $newfile
echo ${element[b]} >> $newfile
fi;
a=$((a+1))
done

tac $newfile > $newfile-final
Sasha

PS - Ghostdog's script below is wonderfully compact -- so much so that I don't know how it works -- there's no mention of the 'dash' in there. Hmmm.

Last edited by GrapefruiTgirl; 12-02-2009 at 12:09 AM. Reason: add more silly code :-)
 
Old 12-01-2009, 10:59 PM   #7
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
gawk
Code:
 gawk -vRS="\n>" '!/--+/&&NR>1{print RT,$0}NR==1&&!/--+/' file

Last edited by ghostdog74; 12-02-2009 at 03:37 AM.
 
Old 12-02-2009, 02:59 AM   #8
Kenhelm
Member
 
Registered: Mar 2008
Location: N. W. England
Distribution: Mandriva
Posts: 333

Rep: Reputation: 141Reputation: 141
This uses the GNU sed 'e' command to execute some bash commands in the sed pattern space.
Code:
X=50       # Allowable missing data %
bashcode='sequ=\1;dashes=${sequ//[^-]/};echo $(( ${#dashes}*100/${#sequ} > '$X' ))'
sed -rn "/>/{N;h; s:.*\n(.*):$bashcode:e; /0/{x;p}}" infile > outfile
EDIT
This should run faster as it restricts the bash code just to those sequences which have a '-' in them instead of doing the calculation for every sequence.
Code:
X=50       # Allowable missing data %
bashcode='sequ=\1;dashes=${sequ//[^-]/};echo $(( ${#dashes}*100/${#sequ} > '$X' ))'
sed -r "/>/{N; /-/{h; s:.*\n(.*):$bashcode:e; /1/d; /0/x}}" infile > outfile

Last edited by Kenhelm; 12-02-2009 at 03:57 AM.
 
Old 12-02-2009, 04:09 PM   #9
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Original Poster
Rep: Reputation: 15
I love this site.

Chris: Thank you for taking a whack at it. Solution 2 is close but there are some problems. Try it with the input below. It removes FAKE_100% and FAKE_LAST but that is all. Also, I fully admit I know nothing about perl so don't laugh... is there a way to do something like "for file in *.txt do..." with perl so I can automate the task?
Code:
>FAKE_0%
MVMISVLAMALNRITAAERR
>FAKE_100%
--------------------
>FAKE_60%
MVRMNVLA------------
>FAKE_55%
MVRMNVLAD-----------
>FAKE_50%
MVRMNVLADA----------
>FAKE_45%
MVRMNVLADAL---------
>FAKE_40%
MVMISVLAMALN--------
>FAKE_35%
MVMISVLAMALNR-------
>FAKE_30%
MVMISVLAMALNRI------
>FAKE_first
M-------------------
>FAKE_first_last
M------------------R
>FAKE_last
-------------------R
Sasha: The number of characters on the even numbered lines (lines NOT containing greater-than signs) varies from file to file but within a file, the number of characters on those lines should be the same in each sequence. When I delete a sequence I want to remove both the header (line containing the greater-than sign) and the sequence itself (line containing dashes and A-Z). You are right that the dashes might not always be found together and the number and pattern they make up is variable. The second script is good but it would be nice if I could set it at a percentage cutoff so I could run it on all files at once.

Ghostdog: When I ran your awk script on the example input file above it returned
Code:
>FAKE_0%
MVMISVLAMALNRITAAERR
I will try to wrap my head around this but for now is there an easy way to adjust the % dash cutoff?

Ken: I ran your sed script on the example input file above but it returned this:
Code:
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
I'll work on this one as well but if you know what the problem might be, please let me know.

Thanks!
Kevin

Last edited by kmkocot; 12-02-2009 at 04:10 PM.
 
Old 12-02-2009, 06:12 PM   #10
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551Reputation: 551
bash + bc

OK, here's the version which deals with the dashes as a % of the string, and removes string pairs with a larger % of dashes than the $percent variable.

Code:
#!/bin/bash
# script is called "dnaclean"
# strips headers and sequences with >X% of dashes.
percent=10

input_filename="dna"
# backup inputfile to .OLD
cp $input_filename "$input_filename.OLD"
# temporary filename:
newfile="$input_filename-new"
# make sure it don't exist yet:
rm $newfile 2>>/dev/null

# stick file data into array:
count=-1
for i in $(tac $input_filename); do
 count=$((count+1))
 element[$count]=$i
done

# run thru array checking percents of dashes:
a=-2
while [ $a -le $count ]; do
 a=$((a+2)); b=$((a+1))
 fields="$(echo "${element[a]}" | gawk -v FS=- '{print NF}')"
 [ "$fields" = "0" ] && fields="1"
 dashes=$((fields - 1)); elemnta=${element[a]}; strlen=${#elemnta}
 [ "$strlen" = "0" ] && break

 percent_of_string_are_dashes="$(echo "$dashes / $strlen * 100" | bc -l)"
 calculate_diff="$(echo "$percent_of_string_are_dashes > $percent" | bc -l)"
 if [ "$calculate_diff" = "0" ]; then
  echo ${element[a]} >> $newfile
  echo ${element[b]} >> $newfile
 fi;
done

# put data back in right order
# and put edited file in place of original
# and remove temp file:
if [ -f $newfile ]; then
 tac $newfile > $newfile-final
 mv -f $newfile-final $input_filename
else
mv -f "$input_filename.OLD" "$input_filename"
fi;
Cheers,
Sasha

Last edited by GrapefruiTgirl; 12-02-2009 at 06:14 PM.
 
Old 12-02-2009, 06:23 PM   #11
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.9, Centos 7.3
Posts: 17,374

Rep: Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383
Ah well, that data is significantly different

Try this
Code:
#!/usr/bin/perl -w
use strict;
my (
    $infile, $limit, $len,  $rec, $cnt, $hdr_rec
   );

# Remove lines with eg 50 % '---'
$limit = .5;

for $infile ( @ARGV )
{
    open( FILE, "<", "$infile" ) or
           die "Can't open file: $infile: $!\n";
    while ( defined ( $rec = <FILE> ) )
    {
        # Remove unwanted chars
        chomp $rec;                 # newline

        # check for header rec
        if( $rec =~ /^>/ )
        {
            $hdr_rec = $rec;
            next;
        }

        # Check sequence rec for '-'
        $len = length($rec);
        $cnt = ($rec =~ tr/-//);   # count num of '-'s
        next if( $cnt/$len > $limit );  # pcent > limit; skip rec

        print "$hdr_rec\n";
        print "$rec\n";
    }

    close(FILE) or die "Can't close file: $infile: $!\n";
}
called as

./lq2.pl t.t t1.t
or you can use wildcards, remembering not to include the program in the list eh?
 
Old 12-02-2009, 07:39 PM   #12
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by kmkocot;3777621

Ghostdog: When I ran your awk script on the example input file above it returned
[CODE
>FAKE_0%
MVMISVLAMALNRITAAERR
[/CODE]
I may have interpreted your output requirement wrongly, but isn't that what you want ? remove those sequences with missing data (dashes)? from your sample, the above doesn't have missing data and the rest is removed.
 
Old 12-02-2009, 09:59 PM   #13
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.9, Centos 7.3
Posts: 17,374

Rep: Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383Reputation: 2383
@Ghostdog; actually he wants to be able to specify a fraction/percentage of dashes that determines whether a pair of lines is removed. See the $limit var in my code; I've set it to 0.5 (ie 50%).
 
Old 12-02-2009, 10:53 PM   #14
Kenhelm
Member
 
Registered: Mar 2008
Location: N. W. England
Distribution: Mandriva
Posts: 333

Rep: Reputation: 141Reputation: 141
The code I posted runs without giving error messages when I try it on both these systems:
GNU bash, version 3.00.16(2)-release (i586-mandriva-linux-gnu)
GNU sed version 4.1.4
and
GNU bash, version 4.00.33(2)-release (i586-mandriva-linux-gnu)
GNU sed version 4.2
Which versions are you using?

The 'sh: Bad substitution' message might be from
${sequ//[^-]/}
or
${#dash}*100/${#sequ}

Try replacing
dashes=${sequ//[^-]/}
with
dashes=$(echo "$sequ" | tr -cd "-")

Code:
#!/bin/bash
X=50      # Allowable missing data %
bashcode='sequ=\1;dashes=$(echo "$sequ" | tr -cd "-");echo $(( ${#dashes}*100/${#sequ} > '$X' ))'
sed -r "/>/{N; /-/{h; s:.*\n(.*):$bashcode:e; /1/d; /0/x}}" infile > outfile
 
Old 12-02-2009, 11:27 PM   #15
kmkocot
Member
 
Registered: Dec 2007
Location: Queensland, Australia
Posts: 122

Original Poster
Rep: Reputation: 15
Proof that there is more than one way to skin a cat. All three of those work wonderfully. Once again, thank you very much for your help.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
bash script to remove first characters from every line (00) Linux - General 8 08-01-2011 10:28 AM
script to remove the lines which are having the duplicate value in 2 fields ajcapri Linux - Newbie 10 11-29-2009 09:38 PM
How to remove first 2 lines of a file in a script nazs Programming 16 02-19-2007 07:08 AM
Console gives dashes instead of lines rob207 Linux - General 1 10-14-2006 10:30 AM
Make a script remove lines from a file? spiffytech Linux - Software 5 12-29-2005 11:50 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 05:43 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration