LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 04-22-2011, 04:51 PM   #1
tgarvin
LQ Newbie
 
Registered: Apr 2011
Posts: 3

Rep: Reputation: 0
How to count and eliminate a repeating char (-) leading up to a needed sequence


I have a file that contains a number of lines of DNA sequences like the single (yet very long) line below. There are far too many trailing and ending dashes then is needed in the file (however some are needed so I cant just delete all leading and trailing dashes).

Therefore, I want to count the number of dashes leading up to the first base (either an A,C,T, or G) for every line in the file. I then want remove the smallest number of leading and trailing dashes among all the lines from each line in the file.

So basically lets say the smallest number of trailing dashes between all lines in the file is 300 and the smallest number of leading dashes between all lines in the file is 500 dashes. I then want to subtract 500 dashes from the beginning of every line and 300 dashes from the end of every line. I hope this explanation is clear.

Thanks in advance,
T

Code:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------CTCTCCC-TCTCCC-----TCTCCC-CCTCCC----CC---TCCCCCTC---TCCCTC-TC-C-------CCACGGTCTCCCTCTGATGCCC---------------------------------AGCCGAAGCTGGACGGTACTGCTGCCATCTCG--------GCTCACTGCAACCTCCCTGCCT-----------GATTCTCCTGCCTCA-GCCTGCCGAGTGCCTGCGA---TT-GCAGGCGCGCGCCGCCACGCCTGACTGG-TTTTCATATTTTTTT-----GGTGGAGACGGGGTTTCCCCGTGTTGGCCGGGCTGGTCTCCAGCTCCTAACCGCGAGTGATCC-GCCAGCCTTGGCCTCCCG-AGGTGCCAGGATT-GCAGACGGAGTCTCGT-----TCACTC----AGTGCTCAA--TGGTG---CCCAGG-CTGGAGTGCAGTGGCA-TGATCTCGGCTG-GCTACAACCTCCACCT-------CCCAGCAGCCTG-CCTTGGCCTCCC-AAAGTGCCGAGATT-GCA-----------GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGA-----GGAGCGTCTCCGCCTGGCCACCCA-TCG-------TCTG-GGATGTGAGGAG---------CCCC-TCTGCCTGGCTGC---CCA--GTCTGGAAA----------------------------------------GTGAGGAGCGTCTCTGCCCGG-CCGCCATCCCATCTAGGAAGTGAGGAGC---------------------------------------------GCCTCTTCCCAGCCG----CCATCACATCTGGGAAGTGAGGA--------GCGTCTCTGCCCGGCCGC---CCATCGTCTGAGAGGTGGGGAGCACCTCTGCCCTGCCGC---------------------------------------------------CCCATCTGGGATGTGAGGAGCGTCTCTGCCCGGC----------------------------------------------------------------------------------------------------------------------------------CGCCCCATCTGAGAA----GTGAGGAGCC------CCTCCG-CCTGGCAGCCGCCCCGTCTGAGAAG----TGAGG---------AGCCCCTCCGCC---------------------------------------------------CAGC-AGCCACCCCGTCTGGGAAGT----------------------------------GAGGAGCGTCTCCGCCTG--------GC-AGCCACCTC---------------------------------------------------------------------------------------------------------------------------------------------------------------------------GTCCGCGAGGGAGGTAGGGGGG-------------------------TCAGCC-----------------------------CCCCGCC--CGGCCAGCCGCCCCGTCC-AGGAGG----------------------------------------------------------------------------------------------------T--GAGGGGC---GCCTCTG---CC--CGGCT--GCC-------CCTTCT--G--GGAAGTGAGGAGC------CCCTCTGCCCGGCCAGCCGCC-------------------------C-------------------------CGTCTGGGAGGGAGGTGGGGGGG--TCAGC-CCCCCG---CCCGGCCAGCCGCCCCGTCC--------GGGAGGGAGGGAGGTGGGGGGGTCAGCCCCCCGCCC-------------------------------------------------GGCCAGCCGCCCCATCCGGGAGGGAGGTGGGGGGGTC-----------------------------------------------AGCCCC-CC-GCCCGGCCAGCCGCCCCATCCGGGAAG---------------------------------TGAGG-----GGCGCCTCT-GCCCGGC----CACCCCTACT----GGGAAGTGAGG-----------------AGCCCCTCTG-------------------------------------------------CCCGGCCAG---------------------CCGC-----------CCC-ATCCGGGAGGGAGGTGGGGGGG-------TCAGCCCCCCG-CCCGGCCAGCCGCCCCGTCCGGGAGG-GAG-GTG----------------------------GG---GAGGGGGTCAGAC--CCCCGCCCGGCCAG-------CCGCCCCTTCTGGGAGGGAGGGAGGTGGGGGGGT------CAGCCCCCC-GCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGGTCAGCCCCCCCACCTGGCCAGCCGCCCCGTCCGGG----AGG---GAGGTGGGGGGTCAGCCCCCC---------------------------------GCCTGGCCAGCCGCCCCGTCCGGGAGGG-----------------------------------------AGGTGGGGGGT---------CAGCC-----CCCCA-CCTGGCCAGCCGCC-CCGTACGGGAGGTGAG-GGGCGCCTT----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------TGCCCGGCC----GCCCC-TACTGGAAAGTGAGGAGCCCCTCTGCCCGGCCA-CCACCCCGTC-----------------------------------------------------------------------TGGGAGGTGTACCCAACAGCTCATTGAGAACGGGCCATGATGACAATGGCGGTTTTGTGGAATAGAAAGGGGGG-------------------------------------------------------AAAGG-TGGGGAAAA-GATTGAGAAATCGGATGGTTGCCGT-GTCTGTGTAGAAAG--AGGTAGACGTGGGAGACTTTTCATTTTGTTC---TGCACTAAGAAAAATTCTTCTGCCTTGGGATCCTGTT-----------GATC-TGTGACCTTACCCCCCAACCC-----TGTGCTCTC-TGAAACATGTGCTGTATCCA-CTCAGGG-TTG--AATGGATTAAGGGCGGTGCAAGATGTGCTTTGTTAAA-CAGATGCTTGAA-GGCAGCATGCTCTTT------AAGAGTCAT--CACCACTCCCTAATCTCAA-GT-AC-CCAGGGACACAAACACTGCG--GAAGGCCG--------CAGGGTCCTCTG-CCTAGGAAAACCAGA-GACCTTTGTTCACTTGTTT----------------ATCTGCT-----------GACCTTCCCTC--------CACTA-TTGTCCTGTGACCCTGCCAAATC----CCCC--TCTGCGAGAAACACCCAAGAATGATCAATAAAAAAAGAAAATGCAAACATAAAAAATAAAAATAAAAATAAAATGTCATCCAACATAATTAGGAGAGTACTGATTAAAGATTTCTTCTTCTGGTCAGTTGCAGTGGATCAATGCCTGTAATCCCAGCACTTTAGGAGGCTGAGGTGGGCTATTCACTTGAGGCCAGGAGTCTGAGACCAGCCTGGCCAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCTGGGCCTGGGGCGCAAGCTGTAATCCCAGCTACTTGGGAGGCTGAGACAGGAGAATCACTTGAACCCGGGAGGTGAAGTTGCAGTGGGCTGAGATCGTGCCACTGCACTCCAGCCTGGGTGACAGAGAGATACTGTGTCAAAAAAAA------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Last edited by tgarvin; 04-26-2011 at 03:19 PM. Reason: [SOLVED]
 
Old 04-22-2011, 07:07 PM   #2
crts
Senior Member
 
Registered: Jan 2010
Posts: 1,845

Rep: Reputation: 629Reputation: 629Reputation: 629Reputation: 629Reputation: 629Reputation: 629
Hi,

try this one:
Code:
#!/bin/bash

MIN_LEAD=-1
MIN_TRAIL=-1
CUR=0
while read line; do
	CUR=$(expr "$line" : '-*')
	if [[ $MIN_LEAD = -1 ]];then
		MIN_LEAD=$CUR
	elif (( CUR < MIN_LEAD ));then
		MIN_LEAD=$CUR
	fi
	CUR=$(expr "$line" : '.*\([^-]-*$\)')
	if [[ $MIN_TRAIL = -1 ]];then
		MIN_TRAIL=$(( ${#CUR} - 1 ))
	elif (( ${#CUR} - 1 < MIN_TRAIL ));then
		MIN_TRAIL=$(( ${#CUR} - 1 ))
	fi
done < file
sed -r "s/^.{$MIN_LEAD}(.*).{$MIN_TRAIL}$/\1/" file

exit
I tested it with this sample and it identified the second line correctly.
Code:
----CTCTCCC-TCTCCC-----TCTCCC-CCTCCC----CC---TCCCCCTC-----
---CLINE2-TCTCCC-----TCTCCC-CCTCCC----CC-----TCCCTGGC----
You might wanna run some more tests with shorter, dummy data before applying it to your real data.
Let me know if it worked.
 
Old 04-22-2011, 07:16 PM   #3
tgarvin
LQ Newbie
 
Registered: Apr 2011
Posts: 3

Original Poster
Rep: Reputation: 0
Thanks for the quick response.

Your script does a good job of removing the leading and trailing dashes but I still need to keep certain leading and trailing dashes intact. What I really need is a count of the leading dashes and of the trailing dashes. Then I can pull out the smallest leading and trailing count and remove that number of dashes from each line in the file.

T
 
Old 04-22-2011, 07:34 PM   #4
crts
Senior Member
 
Registered: Jan 2010
Posts: 1,845

Rep: Reputation: 629Reputation: 629Reputation: 629Reputation: 629Reputation: 629Reputation: 629
Quote:
Originally Posted by tgarvin View Post
Thanks for the quick response.

Your script does a good job of removing the leading and trailing dashes but I still need to keep certain leading and trailing dashes intact. What I really need is a count of the leading dashes and of the trailing dashes. Then I can pull out the smallest leading and trailing count and remove that number of dashes from each line in the file.

T
Well, that is what the script does. It counts the leading dashes and if this number is smaller than MIN_LEAD then MIN_LEAD
is set to the new smallest number of leading dashes. The same procedure is done for the trailing dashes.
The sed command then strips those minimal leading and trailing dashes from every line, i.e. at least one line will be stripped
of all its leading dashes and at least one line will be stripped of all its trailing dashes.
If you do not want 'naked' lines, e.g. each line has to have at least a certain amount of leading and trailing dashes
then you can simply take out the sed and instead have it like this:
Code:
#!/bin/bash

MIN_LEAD=-1
MIN_TRAIL=-1
CUR=0
while read line; do
	CUR=$(expr "$line" : '-*')
	if [[ $MIN_LEAD = -1 ]];then
		MIN_LEAD=$CUR
	elif (( CUR < MIN_LEAD ));then
		MIN_LEAD=$CUR
	fi
	CUR=$(expr "$line" : '.*\([^-]-*$\)')
	if [[ $MIN_TRAIL = -1 ]];then
		MIN_TRAIL=$(( ${#CUR} - 1 ))
	elif (( ${#CUR} - 1 < MIN_TRAIL ));then
		MIN_TRAIL=$(( ${#CUR} - 1 ))
	fi
done < file
echo "${MIN_LEAD}"
echo "${MIN_TRAIL}"
exit
This will output only the minimal amount leading/trailing dashes. You can also have it output the leading/trailing
dashes for every line:
Code:
#!/bin/bash
while read line; do
	echo -n $(expr "$line" : '-*')
	CUR=$(expr "$line" : '.*\([^-]-*$\)')
	echo " $(( ${#CUR} - 1 ))"
done < file
First number will be the leading dashes, second one will count the trailing dashes.
 
Old 04-26-2011, 03:18 PM   #5
tgarvin
LQ Newbie
 
Registered: Apr 2011
Posts: 3

Original Poster
Rep: Reputation: 0
Thanks =)
It's all working
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Help Needed - How to count an item in a tab-delimited txt file? Jason7449 Linux - Newbie 3 03-07-2010 10:20 AM
DBD::mysql::st execute failed: Column count doesn't match value count at row 1 shifter Programming 2 02-24-2010 08:42 PM
error: invalid conversion from const char* to char* Dahakon Programming 1 08-31-2009 10:33 AM
Help needed using char arrays, %s, and if() in C KneeLess Programming 5 08-13-2003 09:42 PM
count char on input centr0 Programming 7 07-06-2003 01:03 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 11:51 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration