[SOLVED] Need help writing a script to remove lines in which >X% of the characters are dashes
Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Each pair of lines represents one amino acid sequence where the first line is the header and the second line is the sequence data. Dashes represent missing data (or gaps). I would like to write a script that searches through each file and removes sequences that contain >X% missing data.
My first thought was to 1) count the number of characters per line, 2) count the number of characters per line that aren't dashes, 3) combine those outputs into one file with two columns, 4) divide $2 by $1, and 5) remove lines in the original file in which that output value falls below the threshold. I can't figure out how to do step 5, though. Am I making this harder than it is? Any suggestions would be greatly appreciated.
Here's the code for where I was heading.
Code:
for file in *.fa
do
while read line; do echo "$line" | tr -cd [:alnum:][:graph:] | wc -c; done < 14432.fa > output1.txt
#prints the number of characters on each line
while read line; do echo "$line" | tr -cd [:alnum:][:graph:] | sed 's/-//g' | wc -c; done < 14432.fa > output2.txt
#prints the number of characters on each line that aren't -
paste output1.txt output2.txt > output3.txt
done
awk '{ score = $2 / $1; print score }' output3.txt
#5?
How about something somewhat simple; Here's some tac and sed:
I made the input file called "dna". Into it, I put a bunch of your sample file from post 1, just copied and pasted it in there. The script is called "dnaclean"
#!/bin/bash
# script is called "dnaclean"
# strips headers and sequences with four dashes.
input_filename="dna"
count=0
for i in $(tac $input_filename); do
element[$count]=$i
count=$((count+1))
done
for ((a=0;a<=$count;a++)); do
if [ "$(echo ${element[a]} | grep -- ----)" ]; then
sed -i "/${element[a]}/d;/${element[((a+1))]}/d" "$input_filename";
a=$((a+1))
fi;
done
It should be rather trivial to edit the script to operate on every file in a given directory.
PS - It just occurred to me that the dashes may not be necessarily grouped together, in which case this won't work. Maybe that's why you wanted to count the dashes? Ok, see below:
Code:
#!/bin/bash
# script is called "dnaclean"
# strips headers and sequences with three dashes.
input_filename="dna"
newfile="$input_filename-new"
rm $newfile
dashes=3
fields="$((dashes+1))"
count=0
for i in $(tac $input_filename); do
element[$count]=$i
count=$((count+1))
done
for ((a=0;a<=$count;a++)); do
b=$((a+1))
if [ ! "$(echo "${element[a]}" | gawk -v FS=- '{print NF}')" = "$fields" ]; then
echo ${element[a]} >> $newfile
echo ${element[b]} >> $newfile
fi;
a=$((a+1))
done
tac $newfile > $newfile-final
Sasha
PS - Ghostdog's script below is wonderfully compact -- so much so that I don't know how it works -- there's no mention of the 'dash' in there. Hmmm.
Last edited by GrapefruiTgirl; 12-02-2009 at 12:09 AM.
Reason: add more silly code :-)
This uses the GNU sed 'e' command to execute some bash commands in the sed pattern space.
Code:
X=50 # Allowable missing data %
bashcode='sequ=\1;dashes=${sequ//[^-]/};echo $(( ${#dashes}*100/${#sequ} > '$X' ))'
sed -rn "/>/{N;h; s:.*\n(.*):$bashcode:e; /0/{x;p}}" infile > outfile
EDIT
This should run faster as it restricts the bash code just to those sequences which have a '-' in them instead of doing the calculation for every sequence.
Chris: Thank you for taking a whack at it. Solution 2 is close but there are some problems. Try it with the input below. It removes FAKE_100% and FAKE_LAST but that is all. Also, I fully admit I know nothing about perl so don't laugh... is there a way to do something like "for file in *.txt do..." with perl so I can automate the task?
Sasha: The number of characters on the even numbered lines (lines NOT containing greater-than signs) varies from file to file but within a file, the number of characters on those lines should be the same in each sequence. When I delete a sequence I want to remove both the header (line containing the greater-than sign) and the sequence itself (line containing dashes and A-Z). You are right that the dashes might not always be found together and the number and pattern they make up is variable. The second script is good but it would be nice if I could set it at a percentage cutoff so I could run it on all files at once.
Ghostdog: When I ran your awk script on the example input file above it returned
Code:
>FAKE_0%
MVMISVLAMALNRITAAERR
I will try to wrap my head around this but for now is there an easy way to adjust the % dash cutoff?
Ken: I ran your sed script on the example input file above but it returned this:
Code:
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
sh: Bad substitution
I'll work on this one as well but if you know what the problem might be, please let me know.
OK, here's the version which deals with the dashes as a % of the string, and removes string pairs with a larger % of dashes than the $percent variable.
Code:
#!/bin/bash
# script is called "dnaclean"
# strips headers and sequences with >X% of dashes.
percent=10
input_filename="dna"
# backup inputfile to .OLD
cp $input_filename "$input_filename.OLD"
# temporary filename:
newfile="$input_filename-new"
# make sure it don't exist yet:
rm $newfile 2>>/dev/null
# stick file data into array:
count=-1
for i in $(tac $input_filename); do
count=$((count+1))
element[$count]=$i
done
# run thru array checking percents of dashes:
a=-2
while [ $a -le $count ]; do
a=$((a+2)); b=$((a+1))
fields="$(echo "${element[a]}" | gawk -v FS=- '{print NF}')"
[ "$fields" = "0" ] && fields="1"
dashes=$((fields - 1)); elemnta=${element[a]}; strlen=${#elemnta}
[ "$strlen" = "0" ] && break
percent_of_string_are_dashes="$(echo "$dashes / $strlen * 100" | bc -l)"
calculate_diff="$(echo "$percent_of_string_are_dashes > $percent" | bc -l)"
if [ "$calculate_diff" = "0" ]; then
echo ${element[a]} >> $newfile
echo ${element[b]} >> $newfile
fi;
done
# put data back in right order
# and put edited file in place of original
# and remove temp file:
if [ -f $newfile ]; then
tac $newfile > $newfile-final
mv -f $newfile-final $input_filename
else
mv -f "$input_filename.OLD" "$input_filename"
fi;
Cheers,
Sasha
Last edited by GrapefruiTgirl; 12-02-2009 at 06:14 PM.
Ghostdog: When I ran your awk script on the example input file above it returned
[CODE
>FAKE_0%
MVMISVLAMALNRITAAERR
[/CODE]
I may have interpreted your output requirement wrongly, but isn't that what you want ? remove those sequences with missing data (dashes)? from your sample, the above doesn't have missing data and the rest is removed.
@Ghostdog; actually he wants to be able to specify a fraction/percentage of dashes that determines whether a pair of lines is removed. See the $limit var in my code; I've set it to 0.5 (ie 50%).
The code I posted runs without giving error messages when I try it on both these systems:
GNU bash, version 3.00.16(2)-release (i586-mandriva-linux-gnu)
GNU sed version 4.1.4
and
GNU bash, version 4.00.33(2)-release (i586-mandriva-linux-gnu)
GNU sed version 4.2
Which versions are you using?
The 'sh: Bad substitution' message might be from
${sequ//[^-]/}
or
${#dash}*100/${#sequ}
Try replacing
dashes=${sequ//[^-]/}
with
dashes=$(echo "$sequ" | tr -cd "-")
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.