LinuxQuestions.org - [SOLVED] Need help with sed to modify only lines of text meeting certain criteria

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Need help with sed to modify only lines of text meeting certain criteria (https://www.linuxquestions.org/questions/programming-9/need-help-with-sed-to-modify-only-lines-of-text-meeting-certain-criteria-746512/)

Need help with sed to modify only lines of text meeting certain criteria

Hi all,

I have a large text file with the format below:

>MCAL_43C14_r_00 872 1 872
GGCCCTTATGGCCTTTTTTTTTTTTTTTTCAAACTTTATAAAAGCTTTAA
TTGGTAGTTTGCTCCTTTAAATGGTAAAATCACAGATAAATTTATTGTGA
TAATTGTCTAGATGATTTTACAAGCAGTATAAATACATAATTGTAAACTC
AGTATATCTGCACAGAGAACAGAATAATTATACTTTTCGCAACTCGTTTC
GACGGTAAGAATGCACCAATTATATCGTCTATGCATGGTTCTCTTTCAAA
ATCTTAAAAATTGTGGGTAACTTTATTGTGTGCACGCCTGAAAGCTAGAA
TGAACTAATTTCTATTGTCCATAAATTTCCTTCAAAATATAGTAGTATAT
TGTAGTGACTTAAATTGGTCGTATTACATGACGTAATTGACGCCACTCCA
TTGGTTGGTAATCCATTTTCAGGATGATGTTGTCCAATCACACGTTTCGG
TCAGCACTTTTGGGAAATATTTCCCAGAATGCATCACATTCTTAAACGAT
TAATTGATATAGACAGATGTTCTTTTTGTTCTTGCTGCAAATAATGATTC
ATGAGACTATAATAATTATACATAGAACATCTTTAAATAAATGAAATTCA
TGAAAATCAAACAGCAGCAACCCGCGGAGTAAAGTGCATTCTCGTCATAT
TTCATACTTTGTCAGATTTATAAACTTTACTGGTATATTTGAGTTCAGTG
TAGATTTTCCATCTTAGCAGTAACGATTTGCTAAATAACATAAATGAGAC
ATATAAAAGCTTAATAAACGCCAACTACCAACAGATATATCTTTAAAAGC
GAAAGCCAACTCTTTTGCCATTTCATCAGTTGAAATCAGCATTTCAGAGG
CACTTATGTTCATGAAAAAATT
>MCAL_52K01_f_00 766 1 766
TAAAAAGAAAAATATCTTGAAGTCTAAAGGTAACTTGAAACACATTTGTT
GGAAAAAGTTCCTTCTTGGAGATCAGTTACCAACAGGTTTTCCAGGGATA
TATGACAAATATTTCATTTGTCAGCGCTTTGCACAGGATGATACGTATAT
AGGATCCACGTCTAGGAAGGCTTTCTATGGTACTGTTTATGATACAAAAC
AACGTGTTCCTCTGATGTCTTTTGGAAGACTAAGAAACCTCTCAGATACT
TCAAAACCACTAATGAAGTTTATGATTGAAAAGGGTTTGGTGTCAACTAA
AAAGCATAAAAGTGTGGTATCGACAGTATACAACTGGCTGAATGGTGCAG
AAGGAAAAGGAATGTTCTACGACAATGGTGAAATTTCAGTCTGTAATCTA
GGTCAATTTCAGGCTGTAAACACCGATTATGATACCTCAGAGTACAAAAT
GCAACAACTTTTACCACACAGTCTCACAGGAAATGACGTAAGAGAGAAAA
TAGCAACATACACACTGACTAATACGGCGCCAATTCACACATCACTACAT
GGAATGTGGGAAACTGCTTTGTCAACTGCGCGTACTTTCGCTGTCGAAAA
GTGTGGAATTCCAGTACTTTTAAATCCCGTGAGGAGACAACGAAACAGAG
TATCACGTGACCATCCAGAGATGTATGTAATATCAGGTGCGGTATCATTA
AACGATGCTGATAGCACAATAGGGAATGGGGTAGCTGTTCCATATCTATT
TTGGTTCGCAGGATGC

I am trying to remove the trailing 3 sets of spaces and numbers (e.g., " 872 1 872") such that the first one would be renamed >MCAL_43C14_r_00 and the second one would be renamed >MCAL_52K01_f_00. I know sed is at least in part the tool for the job but I'm stuck. I must not have the search pattern formatted correctly but I can't figure out what is wrong with it. Also, how do I ask sed to leave the characters I like untouched?

sed -n 's/>MCAL_[0-9][0-9][A-Z][0-9][0-9]_[fr]_[0-9][0-9] [0-9][0-9][0-9] 1 [0-9][0-9][0-9]/>MCAL_[0-9][0-9][A-Z][0-9][0-9]_[fr]_[0-9][0-9]/g' Mytilus_californianus.txt

Thanks!!!
Kevin

Provided that all the lines are formatted as above, the code:

Code:

sed -e 's/ .\+//' Mytilus_californianus.txt

should do the trick.

HTH

Forrest

Does

Code:

awk '{print $1}' file > output

not work?

edit: or even

Code:

cut -f1 < file > output

Several things; the sed command you have there won't work; you have to supply the "-e" parameter just before the regular expression. Also, your sed command will just print out the changes, but not actually commit them. You will need to supply the "-i" parameter ("edit in-place") and definitely with a filename suffix for the original. Your sed command should look something like this:

Code:

sed -n -i.orig -e 's/>MCAL_[0-9][0-9][A-Z][0-9][0-9]_[fr]_[0-9][0-9] [0-9][0-9][0-9] 1 [0-9][0-9][0-9]/>MCAL_[0-9][0-9][A-Z][0-9][0-9]_[fr]_[0-9][0-9]/g' Mytilus_californianus.txt

Also, you might also want to look into awk. I would highly suggest looking to see what awk will do when you pass it your text file as an argument and see how it splits your file up into fields. Also check out cut.

First, be sure that you identify what the general pattern is---either the one to keep or the one to discard.

Let's suppose that we use this for the pattern to KEEP:
">MCAL", any 2 digits, "_","f" or "r","_", any two digits

You can use a "backreference" to find any line containing the pattern and then discard everything except the pattern

The general form (assumes pattern at the beginning of the line):

Code:

sed 's/\(pattern\).*/\1/' filename > newfilename      ##matches pattern plus everything following, and replaces with pattern

With your pattern that you alread defined:

Code:

sed 's/$>MCAL_[0-9][0-9][A-Z][0-9][0-9]_[fr]_[0-9][0-9]$.*/\1/' filename > newfilename

There's a couple of comments in this thread about the -e flag in SED. Note that this is required only when using multiple commands within the same sed invocation.

From the Grymoire tutorial:
http://www.grymoire.com/Unix/Sed.html#uh-13

Quote:

Originally Posted by pixellany (Post 3639052)

Yes, and another alternative for multiple commands is to use newlines to separate the commands. In this case, you also do not need the -e. example:

Code:

sed '

s/xxx/yyy/

s/zzz/www/

...

'

Quote:

Originally Posted by mrtiller (Post 3639514)

Yes, and another alternative for multiple commands is to use newlines to separate the commands. In this case, you also do not need the -e. example:

Code:

sed '

s/xxx/yyy/

s/zzz/www/

...

'

Or semi-colons, I believe:

Code:

sed 's/xxx/yyy/;s/zzz/www/'

Quote:

Originally Posted by pwc101 (Post 3639516)

Or semi-colons, I believe:

Code:

sed 's/xxx/yyy/;s/zzz/www/'

How about that!! So why do we have the -e flag? Maybe for readability?

Quote:

Originally Posted by pixellany (Post 3639579)

How about that!! So why do we have the -e flag? Maybe for readability?

No idea, I'm afraid, but I remember seeing it once, and thought it made sense :)

Thank you all for your help!

Forrestt, can you explain what the pattern in the search term of the sed command you gave means? I realize that it translates to "everything after the space" but I don't understand how.

The command was:

Code:

sed -e 's/ .\+//'

The s/// means substitute. The "." means match any character. The "+" means one or more times, but it must be escaped so that is isn't read as an actual "+" sign. You could have also used a "*" instead of the "+". It means zero or more times.

So, you get, substitute a space followed by any character one or more times and replace it with an empty string. This removes the chars after the first space.

HTH

Forrest

If you are confused by the escape (\), you can also turn on extended regular expressions using the -r flag.

Thus, this works:

Code:

sed -e -r 's/ .+//'

With the -r flag, a literal "+" would then require the escape.

So, I guess there really is no way to escape the escape.......;) NO--WAIT: There is:
If you escape an escape, then it's not an escape......e.g. "\\" means a literal "\"

Are you confused yet?......;)