LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (http://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   sed command to replace special character / (http://www.linuxquestions.org/questions/linux-newbie-8/sed-command-to-replace-special-character-913733/)

Lokelo 11-15-2011 08:27 PM

sed command to replace special character /
 
Hi All,

I'm a biochemist/geneticist at heart, but I have been dabbling with our local high performance computer a bit to assemble some sequence data.

I'm using the following command to replace "2:N:0:ACAGT" with "/2"

sed -i 's/2:N:0:ACAGT//2/g' NQ001_R2t.fastq > NQ001_R2tr.fastq &

Naturally, the command doesn't like the additional "/" any ideas to fix this or an alternative solution?

The file I'm working on is quite large... there are about 30 million entries that need to be changed.

evo2 11-15-2011 09:14 PM

Hi,

an often overlooked feature of sed is that you don't have to use / as the "delimiter" (sorry don't remember the correct name). You can use any character. So just use something else and then / should be treated correctly in your search string.

Eg, using "$" instead of "/"
Code:

sed 's$2:N:0:ACAGT$/2$g' NQ001_R2t.fastq  > NQ001_R2tr.fastq
Note also that since you are redirecting the output to a different file is makes no sense to specify an in place edit (-i).

HTH,

Evo2.

Lokelo 11-15-2011 09:21 PM

That's great, thanks!!

As I said, I'm completely new to this. My main programming experience was 17 years ago in very beginner basic.
I'm mainly copying stuff and trying to figure out how I can get it to work for my own needs.
I was wondering why redirecting the output didn't work, it just produced an empty file. I will remove "-i".

chrism01 11-16-2011 01:03 AM

Here's a good sed HOWTO http://www.grymoire.com/Unix/Sed.html#uh-0. More generally, this is good http://rute.2038bug.com/index.html.gz

Lokelo 11-16-2011 01:39 AM

Thanks for the links. So much to read up on!

linuxwin2 11-16-2011 04:01 AM

Or


sed -i 's/2:N:0:ACAGT/\/2/g' NQ001_R2t.fastq > NQ001_R2tr.fastq &

fortran 11-16-2011 04:11 AM

sed 's/2:N:0:ACAGT/\/2/g' ""path of old file.txt" > "path of new file.txt"

David the H. 11-16-2011 11:36 AM

By the way, "any delimiter" really means almost any ascii character. I've done some tests, and the only ones you cannot use are null and newline. You can even use non-printing control characters with the help of various shell features that let you insert their literal values.

Code:

I=$'\a'        #sets variable I to "system bell" using ansi-c style quoting (bash/ksh)

sed "s${I}foo${I}bar${I}" file

Using the full ${var} form makes it more readable, IMO.

Lokelo 11-22-2011 04:01 AM

Hi All,

I have an additional problem. Turns out that there is some variation in the string I want to replace (I have 5x 30 million entries to go through).

I thought the string is 1:N:0:ACAGTG
However, the N can also be a Y, and the 0 can also be a number with 2 to 4 digits.

I tried to use wildcards (i.e. 1:*:*:ACAGTG, 1:*:**:ACAGTG, 1:*:***:ACAGTG, 1:*:****:ACAGTG in four different sed commands) but it didn't work. Any ideas how I can replace them all? There could be about a hundert variations of the numbers and I don't want to replace them individually.

David the H. 11-23-2011 06:13 AM

It would likely help more if you could post some actual example lines, and show us where they need to be changed. Include examples of lines that should not be matched, if there's any risk of catching the wrong ones.

And please use [code][/code] tags around your code and data, to preserve formatting and to improve readability.

sed doesn't use "wildcards" (traditionally called globbing in computer-ese), it uses regular expressions, which are more complex and powerful, and have a different syntax.

In regex, * means "zero or more of the previous character" (or expression), so ":*" matches a string of colons of any length (including none). Also note that * is what we call "greedy", which means that it will always try to match the longest string possible. In a simple case like this it may not be a problem, but it can be hard to control with more complex patterns.

Many, many computer tools support and use regex, so I highly suggest you get online, find yourself a good regular expressions tutorial, and work your way through it. At the very least learn the basic level stuff. Here's a fairly decent guide to start you off: http://www.grymoire.com/Unix/Regular.html


Anyway, to do what you want you first need to apply sed's "extended" regex option (-r). Taking the description you gave, here are two possibilities, depending on how accurate you need it to be.

Code:

sed -r 's^1:[NY]:[0-9]{1,4}:ACAGTG^/2^'
I decided to use ^ as the delimiter.

[NY] means to match either a single N or Y. [..] is used to specify a list of possible characters that can exist at a position. Similarly...

[0-9]{1,4} means one to four digits. [0-9] means any digit, and {m,n} specifies the number of repeats of the previous match that are allowed (this is the part that requires the use of ext. regex).

So the above will match any number up to four digits long in the third field. However your description appears to state that the field can be either a single 0, or 2-4 digits of any kind. If this means that we have to avoid matching a single digit other than 0, then we have to be more cautious.

Code:

sed -r 's^1:[NY]:(0|[0-9]{2,4}):ACAGTG^/2^'
For this, I've grouped two possible choices together using (|). That position can now be either 0, or it can be 2-4 digits. But it can't be a single non-zero digit.


Notice how effective use of regular expressions depends on you being able to clearly define the pattern to match. In particular you need to be able to state what exactly makes the section you want different from the sections you don't want. So if what I gave doesn't suit your purposes, you'll need to come back with a more detailed explanation of what needs to be matched.

grail 11-23-2011 08:26 AM

It might also help if you obeyed LQ rules and only asked your question in one place (http://www.linuxquestions.org/questi...ommand-914971/)

Lokelo 11-23-2011 08:59 AM

Yes, I'm sorry about that. I hope it is ok if I change this thread on solved an answer in the other thread.

Thank you for your very extensive answer David, it was very helpful. I will read up on the uses regular expression.


All times are GMT -5. The time now is 10:16 PM.