Need help with sed to modify only lines of text meeting certain criteria
Hi all,
I have a large text file with the format below: >MCAL_43C14_r_00 872 1 872 GGCCCTTATGGCCTTTTTTTTTTTTTTTTCAAACTTTATAAAAGCTTTAA TTGGTAGTTTGCTCCTTTAAATGGTAAAATCACAGATAAATTTATTGTGA TAATTGTCTAGATGATTTTACAAGCAGTATAAATACATAATTGTAAACTC AGTATATCTGCACAGAGAACAGAATAATTATACTTTTCGCAACTCGTTTC GACGGTAAGAATGCACCAATTATATCGTCTATGCATGGTTCTCTTTCAAA ATCTTAAAAATTGTGGGTAACTTTATTGTGTGCACGCCTGAAAGCTAGAA TGAACTAATTTCTATTGTCCATAAATTTCCTTCAAAATATAGTAGTATAT TGTAGTGACTTAAATTGGTCGTATTACATGACGTAATTGACGCCACTCCA TTGGTTGGTAATCCATTTTCAGGATGATGTTGTCCAATCACACGTTTCGG TCAGCACTTTTGGGAAATATTTCCCAGAATGCATCACATTCTTAAACGAT TAATTGATATAGACAGATGTTCTTTTTGTTCTTGCTGCAAATAATGATTC ATGAGACTATAATAATTATACATAGAACATCTTTAAATAAATGAAATTCA TGAAAATCAAACAGCAGCAACCCGCGGAGTAAAGTGCATTCTCGTCATAT TTCATACTTTGTCAGATTTATAAACTTTACTGGTATATTTGAGTTCAGTG TAGATTTTCCATCTTAGCAGTAACGATTTGCTAAATAACATAAATGAGAC ATATAAAAGCTTAATAAACGCCAACTACCAACAGATATATCTTTAAAAGC GAAAGCCAACTCTTTTGCCATTTCATCAGTTGAAATCAGCATTTCAGAGG CACTTATGTTCATGAAAAAATT >MCAL_52K01_f_00 766 1 766 TAAAAAGAAAAATATCTTGAAGTCTAAAGGTAACTTGAAACACATTTGTT GGAAAAAGTTCCTTCTTGGAGATCAGTTACCAACAGGTTTTCCAGGGATA TATGACAAATATTTCATTTGTCAGCGCTTTGCACAGGATGATACGTATAT AGGATCCACGTCTAGGAAGGCTTTCTATGGTACTGTTTATGATACAAAAC AACGTGTTCCTCTGATGTCTTTTGGAAGACTAAGAAACCTCTCAGATACT TCAAAACCACTAATGAAGTTTATGATTGAAAAGGGTTTGGTGTCAACTAA AAAGCATAAAAGTGTGGTATCGACAGTATACAACTGGCTGAATGGTGCAG AAGGAAAAGGAATGTTCTACGACAATGGTGAAATTTCAGTCTGTAATCTA GGTCAATTTCAGGCTGTAAACACCGATTATGATACCTCAGAGTACAAAAT GCAACAACTTTTACCACACAGTCTCACAGGAAATGACGTAAGAGAGAAAA TAGCAACATACACACTGACTAATACGGCGCCAATTCACACATCACTACAT GGAATGTGGGAAACTGCTTTGTCAACTGCGCGTACTTTCGCTGTCGAAAA GTGTGGAATTCCAGTACTTTTAAATCCCGTGAGGAGACAACGAAACAGAG TATCACGTGACCATCCAGAGATGTATGTAATATCAGGTGCGGTATCATTA AACGATGCTGATAGCACAATAGGGAATGGGGTAGCTGTTCCATATCTATT TTGGTTCGCAGGATGC I am trying to remove the trailing 3 sets of spaces and numbers (e.g., " 872 1 872") such that the first one would be renamed >MCAL_43C14_r_00 and the second one would be renamed >MCAL_52K01_f_00. I know sed is at least in part the tool for the job but I'm stuck. I must not have the search pattern formatted correctly but I can't figure out what is wrong with it. Also, how do I ask sed to leave the characters I like untouched? sed -n 's/>MCAL_[0-9][0-9][A-Z][0-9][0-9]_[fr]_[0-9][0-9] [0-9][0-9][0-9] 1 [0-9][0-9][0-9]/>MCAL_[0-9][0-9][A-Z][0-9][0-9]_[fr]_[0-9][0-9]/g' Mytilus_californianus.txt Thanks!!! Kevin |
Provided that all the lines are formatted as above, the code:
Code:
sed -e 's/ .\+//' Mytilus_californianus.txt HTH Forrest |
Does
Code:
awk '{print $1}' file > output edit: or even Code:
cut -f1 < file > output |
Several things; the sed command you have there won't work; you have to supply the "-e" parameter just before the regular expression. Also, your sed command will just print out the changes, but not actually commit them. You will need to supply the "-i" parameter ("edit in-place") and definitely with a filename suffix for the original. Your sed command should look something like this:
Code:
sed -n -i.orig -e 's/>MCAL_[0-9][0-9][A-Z][0-9][0-9]_[fr]_[0-9][0-9] [0-9][0-9][0-9] 1 [0-9][0-9][0-9]/>MCAL_[0-9][0-9][A-Z][0-9][0-9]_[fr]_[0-9][0-9]/g' Mytilus_californianus.txt |
First, be sure that you identify what the general pattern is---either the one to keep or the one to discard.
Let's suppose that we use this for the pattern to KEEP: ">MCAL", any 2 digits, "_","f" or "r","_", any two digits You can use a "backreference" to find any line containing the pattern and then discard everything except the pattern The general form (assumes pattern at the beginning of the line): Code:
sed 's/\(pattern\).*/\1/' filename > newfilename ##matches pattern plus everything following, and replaces with pattern Code:
sed 's/\(>MCAL_[0-9][0-9][A-Z][0-9][0-9]_[fr]_[0-9][0-9]\).*/\1/' filename > newfilename |
There's a couple of comments in this thread about the -e flag in SED. Note that this is required only when using multiple commands within the same sed invocation.
From the Grymoire tutorial: http://www.grymoire.com/Unix/Sed.html#uh-13 |
Quote:
Code:
sed ' |
Quote:
Code:
sed 's/xxx/yyy/;s/zzz/www/' |
Quote:
|
Quote:
|
Thank you all for your help!
Forrestt, can you explain what the pattern in the search term of the sed command you gave means? I realize that it translates to "everything after the space" but I don't understand how. The command was: Code:
sed -e 's/ .\+//' |
The s/// means substitute. The "." means match any character. The "+" means one or more times, but it must be escaped so that is isn't read as an actual "+" sign. You could have also used a "*" instead of the "+". It means zero or more times.
So, you get, substitute a space followed by any character one or more times and replace it with an empty string. This removes the chars after the first space. HTH Forrest |
If you are confused by the escape (\), you can also turn on extended regular expressions using the -r flag.
Thus, this works: Code:
sed -e -r 's/ .+//' So, I guess there really is no way to escape the escape.......;) NO--WAIT: There is: If you escape an escape, then it's not an escape......e.g. "\\" means a literal "\" Are you confused yet?......;) |
All times are GMT -5. The time now is 09:22 AM. |