Need help writing a script to remove lines in which >X% of the characters are dashes
Hi all,
I have another request for scripting help :D I have a folder with many input files formatted like this: Code:
>14432|LGIG|186221 My first thought was to 1) count the number of characters per line, 2) count the number of characters per line that aren't dashes, 3) combine those outputs into one file with two columns, 4) divide $2 by $1, and 5) remove lines in the original file in which that output value falls below the threshold. I can't figure out how to do step 5, though. Am I making this harder than it is? Any suggestions would be greatly appreciated. Here's the code for where I was heading. Code:
for file in *.fa Kevin |
Hmm.. Will the data sequences you wish to remove, always have the same number of ----'s in them?
And, do you want to remove the header for the removed data sequence as well? (probably yes) And that's it? If so, I think you're over-complicating the task, but I commend your progress so far. I'll fiddle with it in a few minutes, if I'm understanding the problem correctly. Sasha |
We just MUST feed this into the SED vs. AWK battle......;)
|
Soln1 (remove bad seq recs only)
Code:
#!/usr/bin/perl -w Code:
#!/usr/bin/perl -w |
Quote:
|
strip dna sequences.
LOL.. Battle indeed!
How about something somewhat simple; Here's some tac and sed: I made the input file called "dna". Into it, I put a bunch of your sample file from post 1, just copied and pasted it in there. The script is called "dnaclean" dna before operation: Quote:
Quote:
Code:
#!/bin/bash PS - It just occurred to me that the dashes may not be necessarily grouped together, in which case this won't work. Maybe that's why you wanted to count the dashes? Ok, see below: Code:
#!/bin/bash PS - Ghostdog's script below is wonderfully compact -- so much so that I don't know how it works :scratch: -- there's no mention of the 'dash' in there. Hmmm. |
gawk
Code:
gawk -vRS="\n>" '!/--+/&&NR>1{print RT,$0}NR==1&&!/--+/' file |
This uses the GNU sed 'e' command to execute some bash commands in the sed pattern space.
Code:
X=50 # Allowable missing data % This should run faster as it restricts the bash code just to those sequences which have a '-' in them instead of doing the calculation for every sequence. Code:
X=50 # Allowable missing data % |
I love this site.
Chris: Thank you for taking a whack at it. Solution 2 is close but there are some problems. Try it with the input below. It removes FAKE_100% and FAKE_LAST but that is all. Also, I fully admit I know nothing about perl so don't laugh... is there a way to do something like "for file in *.txt do..." with perl so I can automate the task? Code:
>FAKE_0% Ghostdog: When I ran your awk script on the example input file above it returned Code:
>FAKE_0% Ken: I ran your sed script on the example input file above but it returned this: Code:
sh: Bad substitution Thanks! Kevin |
bash + bc
OK, here's the version which deals with the dashes as a % of the string, and removes string pairs with a larger % of dashes than the $percent variable.
Code:
#!/bin/bash Sasha |
Ah well, that data is significantly different ;)
Try this Code:
#!/usr/bin/perl -w ./lq2.pl t.t t1.t or you can use wildcards, remembering not to include the program in the list eh? :) |
Quote:
|
@Ghostdog; actually he wants to be able to specify a fraction/percentage of dashes that determines whether a pair of lines is removed. See the $limit var in my code; I've set it to 0.5 (ie 50%).
|
The code I posted runs without giving error messages when I try it on both these systems:
GNU bash, version 3.00.16(2)-release (i586-mandriva-linux-gnu) GNU sed version 4.1.4 and GNU bash, version 4.00.33(2)-release (i586-mandriva-linux-gnu) GNU sed version 4.2 Which versions are you using? The 'sh: Bad substitution' message might be from ${sequ//[^-]/} or ${#dash}*100/${#sequ} Try replacing dashes=${sequ//[^-]/} with dashes=$(echo "$sequ" | tr -cd "-") Code:
#!/bin/bash |
Proof that there is more than one way to skin a cat. All three of those work wonderfully. Once again, thank you very much for your help.
|
All times are GMT -5. The time now is 03:59 AM. |