Using wildcards in a sed command
Hi All,
I have an additional problem with the sed command. I would like to replace the string "1:N:0:ACAGTG" with /1. However, I recently found out that the N can also be a Y, and the 0 can also be a number with 1 to 4 digits. The basic command I'm using is: sed -i 's! 1:N:0:ACAGTG!/1!g' NQ001/NQ001_R1r.fastq I tried to use wildcards (i.e. 1:*:*:ACAGTG, 1:*:**:ACAGTG, 1:*:***:ACAGTG, 1:*:****:ACAGTG in four different sed commands) but it didn't work. Any ideas how I can replace them all? There could be about a hundert variations of the numbers in about 30 million entries per file and I don't want to replace them individually. |
I think you're confusing the shell's wildcard * with regular-expressions' 0-n repeat * operator. Your 1:*:*:ACAGTG (etc.) specifies a 1 followed by any number of colons followed by any other number of colons followed by a single colon .... but nowhere in there are you searching for anything between the colons. The only text your expression can match is " 1" followed by at least one colon followed by ACAGTG.
What I think you want is sed -i 's, 1:[^:]*:[^:]*:ACAGTG,/1,g'. That'll match " 1:Flew:OvertheCuckoo'sNest:ACAGTG" so you may want to hunt up how to restrict the matches a bit better. |
That's great, thanks!
I'm pretty sure that the start and end are quite unique and there is no need to restrict the matches better. |
Well if you did want to:
Code:
sed -ri 's!1:[NY]:[0-9]{1,4}:ACAGTG!/1!g' NQ001/NQ001_R1r.fastq |
In both cases I get the following error message:
[jc167987@login NQ017]$ sed -i 's! 1:[^:]*:[^:]*:GCCAAT!/1!g' NQ017_R1r.fastq /1!g: Event not found. [jc167987@login NQ017]$ sed -ri 's! 1:[NY]:[0-9]{1,4}:GCCAAT!/1!g' NQ017_R1r.fastq /1!g: Event not found. Edit: actually, if I run it in a script it seems to work, but I have only tested the first command so far. |
I am curious where you are running this as 'Event not found' is not an error message I have ever seen before from sed??
I have tested and the solutions seem to work ok for me. |
I ran them on our high performance computer running Linux by using putty.
I'm new to all of this, so I'm not sure what other information I could give you. |
What version of sed are you running?
Are the lines you have shown and the respective errors typed or you have copy and pasted from the terminal? |
Quote:
the ! runs the last command that matches the following letters (history expansion). Example: Code:
$ echo hello /1g Since there is no such command it gives the 'event not found' error. Are you sure that you used single-quotes? Those should prevent this kind of error. |
Thanks for all the answers and sorry about the double posting of my question.
I couldn't get the version of sed. I'm working remotly on a high performance computer and sed -V (or v) didn't come up with a version number. I just started working with linux a month ago, so I'm not 100% sure yet what I'm doing all the time. The lines I posted above containing the error message were directly copy/pasted from the terminal. Crts, you said if I run it with double quotes it will try to redo a previous command, but I ran it with single quotes. The thing is, it runs perfectly fine if I use it within a script (see below), but I got the error message when using them directly in the terminal. The data looks like this: Code:
@HWI-ST261:396:B0D48ABXX:8:1101:15630:3112 1:N:0:ACAGTG Code:
#!/bin/bash |
No short option for version so you would need --version.
I would ask, does the above actually work, ie have the changes been made in the file(s)? Are you pushing all the commands into the background because the file(s) are so large? |
The files are about 5 Gb each. Am I correct in the understanding that if I didn't use &, they would just be carried out in sequence? And with the & they are done in parallel?
The above works and the changes were made in the files. I checked by grepping the adaptor sequences (I.e. the strings of A,C,T and G at the end of the ID) and the searches came up with nothing. Each file has about 30 million entries containing the four lines shown above. That's why I didn't find the variation in the numbers until I learned about grep, since they are comparably rare. I have a limited time to assemble my transcriptomes from scratch and I as much as I would love to read up on everything I'm detail, I'm just focussing on what I need for the time being. Hence I use a lot of copying with just enough understanding to make it work. However, I'm highly fascinated by this experience (when I was 16 I had the choice to go into chemistry or IT, and chemistry won, even though I still like using computers on a higher level) and certainly will get as much Linux knowledge as I can over time. I will get the version number tomorrow morning. |
No need for the version number: crts nailed it. Lose the bangs ("!").
Instead of s!this!that!g use s,this,that,g or s`this`that`g or whatever. I like commas or backticks myself, they make a visible break. Until you have time to get better acquainted with shell syntax and its interactive assists, get in the habit of single-quoting any argument that has anything but alphanumerics or +-_/,. You'll gradually find more safe ones, but bang ("!") is high-priority metasyntax for interactively constructing command lines, fast, from pieces of earlier ones. gtg, sorry if this was too elliptical, happy thanksgiving, Jim |
Quote:
Another alternative would be to deactivate history expansion. In bash you can do it with Code:
set +H Can you post the name of the system and the shell you are logged in? And how do you log in (ssh, telnet ...)? This is definitely not a 'sed' issue. |
That seems to have done the trick. Thanks for all your help.
I agree, I like the look of commas. I will make sure that I acknowledge this forum in my thesis for your continued help! Just to be complete: Code:
/bin/tcsh |
Code:
/bin/tcsh You're still very new to shells, may I recommend you switch to bash if only to get yourself on a shell most people will unthinkingly presume you're using? It'll be a little different but shouldn't be too hard. The command to do it is "chsh -s /bin/bash". "cat /etc/shells" to get a list of the options it'll accept on your system if you're curious. |
Thanks for the additional information. I'm not sure if I can actually change the shell on the high performance computer.
Oh, and I think I forgot to answer this before, I log in using putty. If I start my scripts with /bin/bash I should at least be safe there? It's just the active command lines that run on tcsh? Code:
[jc167987@login]$ cat /etc/shells |
say 'man ypcsh', it'll tell you how to run it. Seems everyone uses putty, must be at least pretty good. Yes, to the /bin/bash part, from the command line you can also just say bash like "bash myscript", but /bin/bash is explicit for security. I still recommend changing, it won't be long at all before you start composing one-liners at the keyboard because that's pretty much what unix is made for, it's a toolmaker's paradise. He who made kittens put snakes in the grass, natch :-)
|
All times are GMT -5. The time now is 09:04 PM. |