I have a set of files containing DNA or amino acid sequences from various organisms:
The even-numbered lines beginning with a greater-than symbol are the headers with 3 fields separated by pipe symbols. The number in the first field is the gene ID, the 4-letter abbreviation in the second field is the organism ID, and the third field contains a unique sequence ID. There may be 2+ sequences from an organism (e.g., NEOM and TEST). The lines containing only A-Z characters, question marks, and dashes are amino acid sequences (-, ?, and X represent gaps or missing data). These example sequences are really short and in reality most sequences are several hundred amino acids long.
I would like to write a script to print the range of character positions between the first and last non-missing data characters on the amino acid sequence lines. Internal missing data characters flanked by A-Z characters are OK. Here's what I have in mind for the desired output:
Can anyone help me get the position of the first and last non-missing data characters (while allowing missing data characters in the middle of the sequence)? I'm sure it is a simple sed or awk command but I can't figure it out. I think I can produce the output file I want once I have figured those commands out.
My ultimate goal is to write a script that can make composite sequences from two or more non-overlapping sequences (e.g., the two sequences from NEOM). I may also want to merge sequences that partially overlap (e.g., those from TEST) but that would complicate things. Is this a logical first step for such a script or would you do it differently?