LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Need help with sed to modify only lines of text meeting certain criteria (https://www.linuxquestions.org/questions/programming-9/need-help-with-sed-to-modify-only-lines-of-text-meeting-certain-criteria-746512/)

kmkocot 08-10-2009 02:03 PM

Need help with sed to modify only lines of text meeting certain criteria
 
Hi all,

I have a large text file with the format below:

>MCAL_43C14_r_00 872 1 872
GGCCCTTATGGCCTTTTTTTTTTTTTTTTCAAACTTTATAAAAGCTTTAA
TTGGTAGTTTGCTCCTTTAAATGGTAAAATCACAGATAAATTTATTGTGA
TAATTGTCTAGATGATTTTACAAGCAGTATAAATACATAATTGTAAACTC
AGTATATCTGCACAGAGAACAGAATAATTATACTTTTCGCAACTCGTTTC
GACGGTAAGAATGCACCAATTATATCGTCTATGCATGGTTCTCTTTCAAA
ATCTTAAAAATTGTGGGTAACTTTATTGTGTGCACGCCTGAAAGCTAGAA
TGAACTAATTTCTATTGTCCATAAATTTCCTTCAAAATATAGTAGTATAT
TGTAGTGACTTAAATTGGTCGTATTACATGACGTAATTGACGCCACTCCA
TTGGTTGGTAATCCATTTTCAGGATGATGTTGTCCAATCACACGTTTCGG
TCAGCACTTTTGGGAAATATTTCCCAGAATGCATCACATTCTTAAACGAT
TAATTGATATAGACAGATGTTCTTTTTGTTCTTGCTGCAAATAATGATTC
ATGAGACTATAATAATTATACATAGAACATCTTTAAATAAATGAAATTCA
TGAAAATCAAACAGCAGCAACCCGCGGAGTAAAGTGCATTCTCGTCATAT
TTCATACTTTGTCAGATTTATAAACTTTACTGGTATATTTGAGTTCAGTG
TAGATTTTCCATCTTAGCAGTAACGATTTGCTAAATAACATAAATGAGAC
ATATAAAAGCTTAATAAACGCCAACTACCAACAGATATATCTTTAAAAGC
GAAAGCCAACTCTTTTGCCATTTCATCAGTTGAAATCAGCATTTCAGAGG
CACTTATGTTCATGAAAAAATT
>MCAL_52K01_f_00 766 1 766
TAAAAAGAAAAATATCTTGAAGTCTAAAGGTAACTTGAAACACATTTGTT
GGAAAAAGTTCCTTCTTGGAGATCAGTTACCAACAGGTTTTCCAGGGATA
TATGACAAATATTTCATTTGTCAGCGCTTTGCACAGGATGATACGTATAT
AGGATCCACGTCTAGGAAGGCTTTCTATGGTACTGTTTATGATACAAAAC
AACGTGTTCCTCTGATGTCTTTTGGAAGACTAAGAAACCTCTCAGATACT
TCAAAACCACTAATGAAGTTTATGATTGAAAAGGGTTTGGTGTCAACTAA
AAAGCATAAAAGTGTGGTATCGACAGTATACAACTGGCTGAATGGTGCAG
AAGGAAAAGGAATGTTCTACGACAATGGTGAAATTTCAGTCTGTAATCTA
GGTCAATTTCAGGCTGTAAACACCGATTATGATACCTCAGAGTACAAAAT
GCAACAACTTTTACCACACAGTCTCACAGGAAATGACGTAAGAGAGAAAA
TAGCAACATACACACTGACTAATACGGCGCCAATTCACACATCACTACAT
GGAATGTGGGAAACTGCTTTGTCAACTGCGCGTACTTTCGCTGTCGAAAA
GTGTGGAATTCCAGTACTTTTAAATCCCGTGAGGAGACAACGAAACAGAG
TATCACGTGACCATCCAGAGATGTATGTAATATCAGGTGCGGTATCATTA
AACGATGCTGATAGCACAATAGGGAATGGGGTAGCTGTTCCATATCTATT
TTGGTTCGCAGGATGC

I am trying to remove the trailing 3 sets of spaces and numbers (e.g., " 872 1 872") such that the first one would be renamed >MCAL_43C14_r_00 and the second one would be renamed >MCAL_52K01_f_00. I know sed is at least in part the tool for the job but I'm stuck. I must not have the search pattern formatted correctly but I can't figure out what is wrong with it. Also, how do I ask sed to leave the characters I like untouched?

sed -n 's/>MCAL_[0-9][0-9][A-Z][0-9][0-9]_[fr]_[0-9][0-9] [0-9][0-9][0-9] 1 [0-9][0-9][0-9]/>MCAL_[0-9][0-9][A-Z][0-9][0-9]_[fr]_[0-9][0-9]/g' Mytilus_californianus.txt

Thanks!!!
Kevin

forrestt 08-10-2009 02:10 PM

Provided that all the lines are formatted as above, the code:

Code:

sed -e 's/ .\+//' Mytilus_californianus.txt
should do the trick.

HTH

Forrest

pwc101 08-10-2009 02:11 PM

Does
Code:

awk '{print $1}' file > output
not work?

edit: or even
Code:

cut -f1 < file > output
?

indienick 08-10-2009 02:15 PM

Several things; the sed command you have there won't work; you have to supply the "-e" parameter just before the regular expression. Also, your sed command will just print out the changes, but not actually commit them. You will need to supply the "-i" parameter ("edit in-place") and definitely with a filename suffix for the original. Your sed command should look something like this:
Code:

sed -n -i.orig -e 's/>MCAL_[0-9][0-9][A-Z][0-9][0-9]_[fr]_[0-9][0-9] [0-9][0-9][0-9] 1 [0-9][0-9][0-9]/>MCAL_[0-9][0-9][A-Z][0-9][0-9]_[fr]_[0-9][0-9]/g' Mytilus_californianus.txt
Also, you might also want to look into awk. I would highly suggest looking to see what awk will do when you pass it your text file as an argument and see how it splits your file up into fields. Also check out cut.

pixellany 08-10-2009 02:25 PM

First, be sure that you identify what the general pattern is---either the one to keep or the one to discard.

Let's suppose that we use this for the pattern to KEEP:
">MCAL", any 2 digits, "_","f" or "r","_", any two digits

You can use a "backreference" to find any line containing the pattern and then discard everything except the pattern

The general form (assumes pattern at the beginning of the line):

Code:

sed 's/\(pattern\).*/\1/' filename > newfilename      ##matches pattern plus everything following, and replaces with pattern
With your pattern that you alread defined:

Code:

sed 's/\(>MCAL_[0-9][0-9][A-Z][0-9][0-9]_[fr]_[0-9][0-9]\).*/\1/' filename > newfilename

pixellany 08-11-2009 06:02 AM

There's a couple of comments in this thread about the -e flag in SED. Note that this is required only when using multiple commands within the same sed invocation.

From the Grymoire tutorial:
http://www.grymoire.com/Unix/Sed.html#uh-13

mrtiller 08-11-2009 12:09 PM

Quote:

Originally Posted by pixellany (Post 3639052)
There's a couple of comments in this thread about the -e flag in SED. Note that this is required only when using multiple commands within the same sed invocation.

From the Grymoire tutorial:
http://www.grymoire.com/Unix/Sed.html#uh-13

Yes, and another alternative for multiple commands is to use newlines to separate the commands. In this case, you also do not need the -e. example:

Code:

sed '
s/xxx/yyy/
s/zzz/www/
...
'


pwc101 08-11-2009 12:10 PM

Quote:

Originally Posted by mrtiller (Post 3639514)
Yes, and another alternative for multiple commands is to use newlines to separate the commands. In this case, you also do not need the -e. example:

Code:

sed '
s/xxx/yyy/
s/zzz/www/
...
'


Or semi-colons, I believe:
Code:

sed 's/xxx/yyy/;s/zzz/www/'

pixellany 08-11-2009 01:13 PM

Quote:

Originally Posted by pwc101 (Post 3639516)
Or semi-colons, I believe:
Code:

sed 's/xxx/yyy/;s/zzz/www/'

How about that!! So why do we have the -e flag? Maybe for readability?

pwc101 08-11-2009 03:11 PM

Quote:

Originally Posted by pixellany (Post 3639579)
How about that!! So why do we have the -e flag? Maybe for readability?

No idea, I'm afraid, but I remember seeing it once, and thought it made sense :)

kmkocot 08-17-2009 10:24 AM

Thank you all for your help!

Forrestt, can you explain what the pattern in the search term of the sed command you gave means? I realize that it translates to "everything after the space" but I don't understand how.

The command was:
Code:

sed -e 's/ .\+//'

forrestt 08-17-2009 11:08 AM

The s/// means substitute. The "." means match any character. The "+" means one or more times, but it must be escaped so that is isn't read as an actual "+" sign. You could have also used a "*" instead of the "+". It means zero or more times.

So, you get, substitute a space followed by any character one or more times and replace it with an empty string. This removes the chars after the first space.

HTH

Forrest

pixellany 08-17-2009 11:50 AM

If you are confused by the escape (\), you can also turn on extended regular expressions using the -r flag.

Thus, this works:
Code:

sed -e -r 's/ .+//'
With the -r flag, a literal "+" would then require the escape.

So, I guess there really is no way to escape the escape.......;) NO--WAIT: There is:
If you escape an escape, then it's not an escape......e.g. "\\" means a literal "\"

Are you confused yet?......;)


All times are GMT -5. The time now is 09:22 AM.