LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   How to count and eliminate a repeating char (-) leading up to a needed sequence (https://www.linuxquestions.org/questions/linux-software-2/how-to-count-and-eliminate-a-repeating-char-leading-up-to-a-needed-sequence-876568/)

tgarvin 04-22-2011 03:51 PM

How to count and eliminate a repeating char (-) leading up to a needed sequence
 
I have a file that contains a number of lines of DNA sequences like the single (yet very long) line below. There are far too many trailing and ending dashes then is needed in the file (however some are needed so I cant just delete all leading and trailing dashes).

Therefore, I want to count the number of dashes leading up to the first base (either an A,C,T, or G) for every line in the file. I then want remove the smallest number of leading and trailing dashes among all the lines from each line in the file.

So basically lets say the smallest number of trailing dashes between all lines in the file is 300 and the smallest number of leading dashes between all lines in the file is 500 dashes. I then want to subtract 500 dashes from the beginning of every line and 300 dashes from the end of every line. I hope this explanation is clear.

Thanks in advance,
T

Code:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------CTCTCCC-TCTCCC-----TCTCCC-CCTCCC----CC---TCCCCCTC---TCCCTC-TC-C-------CCACGGTCTCCCTCTGATGCCC---------------------------------AGCCGAAGCTGGACGGTACTGCTGCCATCTCG--------GCTCACTGCAACCTCCCTGCCT-----------GATTCTCCTGCCTCA-GCCTGCCGAGTGCCTGCGA---TT-GCAGGCGCGCGCCGCCACGCCTGACTGG-TTTTCATATTTTTTT-----GGTGGAGACGGGGTTTCCCCGTGTTGGCCGGGCTGGTCTCCAGCTCCTAACCGCGAGTGATCC-GCCAGCCTTGGCCTCCCG-AGGTGCCAGGATT-GCAGACGGAGTCTCGT-----TCACTC----AGTGCTCAA--TGGTG---CCCAGG-CTGGAGTGCAGTGGCA-TGATCTCGGCTG-GCTACAACCTCCACCT-------CCCAGCAGCCTG-CCTTGGCCTCCC-AAAGTGCCGAGATT-GCA-----------GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGA-----GGAGCGTCTCCGCCTGGCCACCCA-TCG-------TCTG-GGATGTGAGGAG---------CCCC-TCTGCCTGGCTGC---CCA--GTCTGGAAA----------------------------------------GTGAGGAGCGTCTCTGCCCGG-CCGCCATCCCATCTAGGAAGTGAGGAGC---------------------------------------------GCCTCTTCCCAGCCG----CCATCACATCTGGGAAGTGAGGA--------GCGTCTCTGCCCGGCCGC---CCATCGTCTGAGAGGTGGGGAGCACCTCTGCCCTGCCGC---------------------------------------------------CCCATCTGGGATGTGAGGAGCGTCTCTGCCCGGC----------------------------------------------------------------------------------------------------------------------------------CGCCCCATCTGAGAA----GTGAGGAGCC------CCTCCG-CCTGGCAGCCGCCCCGTCTGAGAAG----TGAGG---------AGCCCCTCCGCC---------------------------------------------------CAGC-AGCCACCCCGTCTGGGAAGT----------------------------------GAGGAGCGTCTCCGCCTG--------GC-AGCCACCTC---------------------------------------------------------------------------------------------------------------------------------------------------------------------------GTCCGCGAGGGAGGTAGGGGGG-------------------------TCAGCC-----------------------------CCCCGCC--CGGCCAGCCGCCCCGTCC-AGGAGG----------------------------------------------------------------------------------------------------T--GAGGGGC---GCCTCTG---CC--CGGCT--GCC-------CCTTCT--G--GGAAGTGAGGAGC------CCCTCTGCCCGGCCAGCCGCC-------------------------C-------------------------CGTCTGGGAGGGAGGTGGGGGGG--TCAGC-CCCCCG---CCCGGCCAGCCGCCCCGTCC--------GGGAGGGAGGGAGGTGGGGGGGTCAGCCCCCCGCCC-------------------------------------------------GGCCAGCCGCCCCATCCGGGAGGGAGGTGGGGGGGTC-----------------------------------------------AGCCCC-CC-GCCCGGCCAGCCGCCCCATCCGGGAAG---------------------------------TGAGG-----GGCGCCTCT-GCCCGGC----CACCCCTACT----GGGAAGTGAGG-----------------AGCCCCTCTG-------------------------------------------------CCCGGCCAG---------------------CCGC-----------CCC-ATCCGGGAGGGAGGTGGGGGGG-------TCAGCCCCCCG-CCCGGCCAGCCGCCCCGTCCGGGAGG-GAG-GTG----------------------------GG---GAGGGGGTCAGAC--CCCCGCCCGGCCAG-------CCGCCCCTTCTGGGAGGGAGGGAGGTGGGGGGGT------CAGCCCCCC-GCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGGTCAGCCCCCCCACCTGGCCAGCCGCCCCGTCCGGG----AGG---GAGGTGGGGGGTCAGCCCCCC---------------------------------GCCTGGCCAGCCGCCCCGTCCGGGAGGG-----------------------------------------AGGTGGGGGGT---------CAGCC-----CCCCA-CCTGGCCAGCCGCC-CCGTACGGGAGGTGAG-GGGCGCCTT----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------TGCCCGGCC----GCCCC-TACTGGAAAGTGAGGAGCCCCTCTGCCCGGCCA-CCACCCCGTC-----------------------------------------------------------------------TGGGAGGTGTACCCAACAGCTCATTGAGAACGGGCCATGATGACAATGGCGGTTTTGTGGAATAGAAAGGGGGG-------------------------------------------------------AAAGG-TGGGGAAAA-GATTGAGAAATCGGATGGTTGCCGT-GTCTGTGTAGAAAG--AGGTAGACGTGGGAGACTTTTCATTTTGTTC---TGCACTAAGAAAAATTCTTCTGCCTTGGGATCCTGTT-----------GATC-TGTGACCTTACCCCCCAACCC-----TGTGCTCTC-TGAAACATGTGCTGTATCCA-CTCAGGG-TTG--AATGGATTAAGGGCGGTGCAAGATGTGCTTTGTTAAA-CAGATGCTTGAA-GGCAGCATGCTCTTT------AAGAGTCAT--CACCACTCCCTAATCTCAA-GT-AC-CCAGGGACACAAACACTGCG--GAAGGCCG--------CAGGGTCCTCTG-CCTAGGAAAACCAGA-GACCTTTGTTCACTTGTTT----------------ATCTGCT-----------GACCTTCCCTC--------CACTA-TTGTCCTGTGACCCTGCCAAATC----CCCC--TCTGCGAGAAACACCCAAGAATGATCAATAAAAAAAGAAAATGCAAACATAAAAAATAAAAATAAAAATAAAATGTCATCCAACATAATTAGGAGAGTACTGATTAAAGATTTCTTCTTCTGGTCAGTTGCAGTGGATCAATGCCTGTAATCCCAGCACTTTAGGAGGCTGAGGTGGGCTATTCACTTGAGGCCAGGAGTCTGAGACCAGCCTGGCCAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCCTGGGGCGCAAGCTGTAATCCCAGCTACTTGGGAGGCTGAGACAGGAGAATCACTTGAACCCGGGAGGTGAAGTTGCAGTGGGCTGAGATCGTGCCACTGCACTCCAGCCTGGGTGACAGAGAGATACTGTGTCAAAAAAAA------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

crts 04-22-2011 06:07 PM

Hi,

try this one:
Code:

#!/bin/bash

MIN_LEAD=-1
MIN_TRAIL=-1
CUR=0
while read line; do
        CUR=$(expr "$line" : '-*')
        if [[ $MIN_LEAD = -1 ]];then
                MIN_LEAD=$CUR
        elif (( CUR < MIN_LEAD ));then
                MIN_LEAD=$CUR
        fi
        CUR=$(expr "$line" : '.*\([^-]-*$\)')
        if [[ $MIN_TRAIL = -1 ]];then
                MIN_TRAIL=$(( ${#CUR} - 1 ))
        elif (( ${#CUR} - 1 < MIN_TRAIL ));then
                MIN_TRAIL=$(( ${#CUR} - 1 ))
        fi
done < file
sed -r "s/^.{$MIN_LEAD}(.*).{$MIN_TRAIL}$/\1/" file

exit

I tested it with this sample and it identified the second line correctly.
Code:

----CTCTCCC-TCTCCC-----TCTCCC-CCTCCC----CC---TCCCCCTC-----
---CLINE2-TCTCCC-----TCTCCC-CCTCCC----CC-----TCCCTGGC----

You might wanna run some more tests with shorter, dummy data before applying it to your real data.
Let me know if it worked.

tgarvin 04-22-2011 06:16 PM

Thanks for the quick response.

Your script does a good job of removing the leading and trailing dashes but I still need to keep certain leading and trailing dashes intact. What I really need is a count of the leading dashes and of the trailing dashes. Then I can pull out the smallest leading and trailing count and remove that number of dashes from each line in the file.

T

crts 04-22-2011 06:34 PM

Quote:

Originally Posted by tgarvin (Post 4333080)
Thanks for the quick response.

Your script does a good job of removing the leading and trailing dashes but I still need to keep certain leading and trailing dashes intact. What I really need is a count of the leading dashes and of the trailing dashes. Then I can pull out the smallest leading and trailing count and remove that number of dashes from each line in the file.

T

Well, that is what the script does. It counts the leading dashes and if this number is smaller than MIN_LEAD then MIN_LEAD
is set to the new smallest number of leading dashes. The same procedure is done for the trailing dashes.
The sed command then strips those minimal leading and trailing dashes from every line, i.e. at least one line will be stripped
of all its leading dashes and at least one line will be stripped of all its trailing dashes.
If you do not want 'naked' lines, e.g. each line has to have at least a certain amount of leading and trailing dashes
then you can simply take out the sed and instead have it like this:
Code:

#!/bin/bash

MIN_LEAD=-1
MIN_TRAIL=-1
CUR=0
while read line; do
        CUR=$(expr "$line" : '-*')
        if [[ $MIN_LEAD = -1 ]];then
                MIN_LEAD=$CUR
        elif (( CUR < MIN_LEAD ));then
                MIN_LEAD=$CUR
        fi
        CUR=$(expr "$line" : '.*\([^-]-*$\)')
        if [[ $MIN_TRAIL = -1 ]];then
                MIN_TRAIL=$(( ${#CUR} - 1 ))
        elif (( ${#CUR} - 1 < MIN_TRAIL ));then
                MIN_TRAIL=$(( ${#CUR} - 1 ))
        fi
done < file
echo "${MIN_LEAD}"
echo "${MIN_TRAIL}"
exit

This will output only the minimal amount leading/trailing dashes. You can also have it output the leading/trailing
dashes for every line:
Code:

#!/bin/bash
while read line; do
        echo -n $(expr "$line" : '-*')
        CUR=$(expr "$line" : '.*\([^-]-*$\)')
        echo " $(( ${#CUR} - 1 ))"
done < file

First number will be the leading dashes, second one will count the trailing dashes.

tgarvin 04-26-2011 02:18 PM

Thanks =)
It's all working


All times are GMT -5. The time now is 09:23 PM.