How to count and eliminate a repeating char (-) leading up to a needed sequence
I have a file that contains a number of lines of DNA sequences like the single (yet very long) line below. There are far too many trailing and ending dashes then is needed in the file (however some are needed so I cant just delete all leading and trailing dashes).
Therefore, I want to count the number of dashes leading up to the first base (either an A,C,T, or G) for every line in the file. I then want remove the smallest number of leading and trailing dashes among all the lines from each line in the file. So basically lets say the smallest number of trailing dashes between all lines in the file is 300 and the smallest number of leading dashes between all lines in the file is 500 dashes. I then want to subtract 500 dashes from the beginning of every line and 300 dashes from the end of every line. I hope this explanation is clear. Thanks in advance, T Code:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------CTCTCCC-TCTCCC-----TCTCCC-CCTCCC----CC---TCCCCCTC---TCCCTC-TC-C-------CCACGGTCTCCCTCTGATGCCC---------------------------------AGCCGAAGCTGGACGGTACTGCTGCCATCTCG--------GCTCACTGCAACCTCCCTGCCT-----------GATTCTCCTGCCTCA-GCCTGCCGAGTGCCTGCGA---TT-GCAGGCGCGCGCCGCCACGCCTGACTGG-TTTTCATATTTTTTT-----GGTGGAGACGGGGTTTCCCCGTGTTGGCCGGGCTGGTCTCCAGCTCCTAACCGCGAGTGATCC-GCCAGCCTTGGCCTCCCG-AGGTGCCAGGATT-GCAGACGGAGTCTCGT-----TCACTC----AGTGCTCAA--TGGTG---CCCAGG-CTGGAGTGCAGTGGCA-TGATCTCGGCTG-GCTACAACCTCCACCT-------CCCAGCAGCCTG-CCTTGGCCTCCC-AAAGTGCCGAGATT-GCA-----------GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGA-----GGAGCGTCTCCGCCTGGCCACCCA-TCG-------TCTG-GGATGTGAGGAG---------CCCC-TCTGCCTGGCTGC---CCA--GTCTGGAAA----------------------------------------GTGAGGAGCGTCTCTGCCCGG-CCGCCATCCCATCTAGGAAGTGAGGAGC---------------------------------------------GCCTCTTCCCAGCCG----CCATCACATCTGGGAAGTGAGGA--------GCGTCTCTGCCCGGCCGC---CCATCGTCTGAGAGGTGGGGAGCACCTCTGCCCTGCCGC---------------------------------------------------CCCATCTGGGATGTGAGGAGCGTCTCTGCCCGGC----------------------------------------------------------------------------------------------------------------------------------CGCCCCATCTGAGAA----GTGAGGAGCC------CCTCCG-CCTGGCAGCCGCCCCGTCTGAGAAG----TGAGG---------AGCCCCTCCGCC---------------------------------------------------CAGC-AGCCACCCCGTCTGGGAAGT----------------------------------GAGGAGCGTCTCCGCCTG--------GC-AGCCACCTC---------------------------------------------------------------------------------------------------------------------------------------------------------------------------GTCCGCGAGGGAGGTAGGGGGG-------------------------TCAGCC-----------------------------CCCCGCC--CGGCCAGCCGCCCCGTCC-AGGAGG----------------------------------------------------------------------------------------------------T--GAGGGGC---GCCTCTG---CC--CGGCT--GCC-------CCTTCT--G--GGAAGTGAGGAGC------CCCTCTGCCCGGCCAGCCGCC-------------------------C-------------------------CGTCTGGGAGGGAGGTGGGGGGG--TCAGC-CCCCCG---CCCGGCCAGCCGCCCCGTCC--------GGGAGGGAGGGAGGTGGGGGGGTCAGCCCCCCGCCC-------------------------------------------------GGCCAGCCGCCCCATCCGGGAGGGAGGTGGGGGGGTC-----------------------------------------------AGCCCC-CC-GCCCGGCCAGCCGCCCCATCCGGGAAG---------------------------------TGAGG-----GGCGCCTCT-GCCCGGC----CACCCCTACT----GGGAAGTGAGG-----------------AGCCCCTCTG-------------------------------------------------CCCGGCCAG---------------------CCGC-----------CCC-ATCCGGGAGGGAGGTGGGGGGG-------TCAGCCCCCCG-CCCGGCCAGCCGCCCCGTCCGGGAGG-GAG-GTG----------------------------GG---GAGGGGGTCAGAC--CCCCGCCCGGCCAG-------CCGCCCCTTCTGGGAGGGAGGGAGGTGGGGGGGT------CAGCCCCCC-GCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGGTCAGCCCCCCCACCTGGCCAGCCGCCCCGTCCGGG----AGG---GAGGTGGGGGGTCAGCCCCCC---------------------------------GCCTGGCCAGCCGCCCCGTCCGGGAGGG-----------------------------------------AGGTGGGGGGT---------CAGCC-----CCCCA-CCTGGCCAGCCGCC-CCGTACGGGAGGTGAG-GGGCGCCTT----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------TGCCCGGCC----GCCCC-TACTGGAAAGTGAGGAGCCCCTCTGCCCGGCCA-CCACCCCGTC-----------------------------------------------------------------------TGGGAGGTGTACCCAACAGCTCATTGAGAACGGGCCATGATGACAATGGCGGTTTTGTGGAATAGAAAGGGGGG-------------------------------------------------------AAAGG-TGGGGAAAA-GATTGAGAAATCGGATGGTTGCCGT-GTCTGTGTAGAAAG--AGGTAGACGTGGGAGACTTTTCATTTTGTTC---TGCACTAAGAAAAATTCTTCTGCCTTGGGATCCTGTT-----------GATC-TGTGACCTTACCCCCCAACCC-----TGTGCTCTC-TGAAACATGTGCTGTATCCA-CTCAGGG-TTG--AATGGATTAAGGGCGGTGCAAGATGTGCTTTGTTAAA-CAGATGCTTGAA-GGCAGCATGCTCTTT------AAGAGTCAT--CACCACTCCCTAATCTCAA-GT-AC-CCAGGGACACAAACACTGCG--GAAGGCCG--------CAGGGTCCTCTG-CCTAGGAAAACCAGA-GACCTTTGTTCACTTGTTT----------------ATCTGCT-----------GACCTTCCCTC--------CACTA-TTGTCCTGTGACCCTGCCAAATC----CCCC--TCTGCGAGAAACACCCAAGAATGATCAATAAAAAAAGAAAATGCAAACATAAAAAATAAAAATAAAAATAAAATGTCATCCAACATAATTAGGAGAGTACTGATTAAAGATTTCTTCTTCTGGTCAGTTGCAGTGGATCAATGCCTGTAATCCCAGCACTTTAGGAGGCTGAGGTGGGCTATTCACTTGAGGCCAGGAGTCTGAGACCAGCCTGGCCAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCCTGGGGCGCAAGCTGTAATCCCAGCTACTTGGGAGGCTGAGACAGGAGAATCACTTGAACCCGGGAGGTGAAGTTGCAGTGGGCTGAGATCGTGCCACTGCACTCCAGCCTGGGTGACAGAGAGATACTGTGTCAAAAAAAA------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
Hi,
try this one: Code:
#!/bin/bash Code:
----CTCTCCC-TCTCCC-----TCTCCC-CCTCCC----CC---TCCCCCTC----- Let me know if it worked. |
Thanks for the quick response.
Your script does a good job of removing the leading and trailing dashes but I still need to keep certain leading and trailing dashes intact. What I really need is a count of the leading dashes and of the trailing dashes. Then I can pull out the smallest leading and trailing count and remove that number of dashes from each line in the file. T |
Quote:
is set to the new smallest number of leading dashes. The same procedure is done for the trailing dashes. The sed command then strips those minimal leading and trailing dashes from every line, i.e. at least one line will be stripped of all its leading dashes and at least one line will be stripped of all its trailing dashes. If you do not want 'naked' lines, e.g. each line has to have at least a certain amount of leading and trailing dashes then you can simply take out the sed and instead have it like this: Code:
#!/bin/bash dashes for every line: Code:
#!/bin/bash |
Thanks =)
It's all working |
All times are GMT -5. The time now is 09:23 PM. |