[SOLVED] Using wildcards in a sed command

Lokelo · 11-22-2011, 05:14 PM

Hi All,

I have an additional problem with the sed command.

I would like to replace the string "1:N:0:ACAGTG" with /1. However, I recently found out that the N can also be a Y, and the 0 can also be a number with 1 to 4 digits.

The basic command I'm using is: sed -i 's! 1:N:0:ACAGTG!/1!g' NQ001/NQ001_R1r.fastq

I tried to use wildcards (i.e. 1:*:*:ACAGTG, 1:*:**:ACAGTG, 1:*:***:ACAGTG, 1:*:****:ACAGTG in four different sed commands) but it didn't work. Any ideas how I can replace them all? There could be about a hundert variations of the numbers in about 30 million entries per file and I don't want to replace them individually.

jthill · 11-22-2011, 06:23 PM

I think you're confusing the shell's wildcard * with regular-expressions' 0-n repeat * operator. Your 1:*:*:ACAGTG (etc.) specifies a 1 followed by any number of colons followed by any other number of colons followed by a single colon .... but nowhere in there are you searching for anything between the colons. The only text your expression can match is " 1" followed by at least one colon followed by ACAGTG.

What I think you want is sed -i 's, 1:[^:]*:[^:]*:ACAGTG,/1,g'. That'll match " 1:Flew:OvertheCuckoo'sNest:ACAGTG" so you may want to hunt up how to restrict the matches a bit better.

Lokelo · 11-22-2011, 06:36 PM

That's great, thanks!

I'm pretty sure that the start and end are quite unique and there is no need to restrict the matches better.

grail · 11-22-2011, 06:51 PM

Well if you did want to:

Code:

sed -ri 's!1:[NY]:[0-9]{1,4}:ACAGTG!/1!g' NQ001/NQ001_R1r.fastq

Lokelo · 11-22-2011, 07:07 PM

In both cases I get the following error message:

[jc167987@login NQ017]$ sed -i 's! 1:[^:]*:[^:]*:GCCAAT!/1!g' NQ017_R1r.fastq
/1!g: Event not found.
[jc167987@login NQ017]$ sed -ri 's! 1:[NY]:[0-9]{1,4}:GCCAAT!/1!g' NQ017_R1r.fastq
/1!g: Event not found.

Edit: actually, if I run it in a script it seems to work, but I have only tested the first command so far.

grail · 11-23-2011, 01:22 AM

I am curious where you are running this as 'Event not found' is not an error message I have ever seen before from sed??
I have tested and the solutions seem to work ok for me.

Lokelo · 11-23-2011, 01:51 AM

I ran them on our high performance computer running Linux by using putty.

I'm new to all of this, so I'm not sure what other information I could give you.

grail · 11-23-2011, 02:21 AM

What version of sed are you running?
Are the lines you have shown and the respective errors typed or you have copy and pasted from the terminal?

crts · 11-23-2011, 02:34 AM

Quote:

Originally Posted by Lokelo

In both cases I get the following error message:

[jc167987@login NQ017]$ sed -i 's! 1:[^:]*:[^:]*:GCCAAT!/1!g' NQ017_R1r.fastq
/1!g: Event not found.
[jc167987@login NQ017]$ sed -ri 's! 1:[NY]:[0-9]{1,4}:GCCAAT!/1!g' NQ017_R1r.fastq
/1!g: Event not found.

Edit: actually, if I run it in a script it seems to work, but I have only tested the first command so far.

Hi,

the ! runs the last command that matches the following letters (history expansion). Example:

Code:

$ echo hello
hello
$ !ec
echo hello
hello
$

If you run the above 'sed' with double-quotes instead of single-quotes then bash will try to match a command that you issued earlier that starts with
/1g

Since there is no such command it gives the 'event not found' error. Are you sure that you used single-quotes? Those should prevent this kind of error.

Lokelo · 11-23-2011, 08:15 AM

Thanks for all the answers and sorry about the double posting of my question.

I couldn't get the version of sed. I'm working remotly on a high performance computer and sed -V (or v) didn't come up with a version number.
I just started working with linux a month ago, so I'm not 100% sure yet what I'm doing all the time.

The lines I posted above containing the error message were directly copy/pasted from the terminal.

Crts, you said if I run it with double quotes it will try to redo a previous command, but I ran it with single quotes.

The thing is, it runs perfectly fine if I use it within a script (see below), but I got the error message when using them directly in the terminal.

The data looks like this:

Code:

@HWI-ST261:396:B0D48ABXX:8:1101:15630:3112 1:N:0:ACAGTG
TCAGGGGTGAATGGATGCACTGTTCTGGATGGTGGTGCTGAACTAGCACCGGGTGCTTGTGGATGTGCCAGAGAAGCATACAAGGGGACGGTGGAAGGAT
+
BCCFFFF@FHGHHJJJJJJJJJJJJJIJEGIJCAH@GHGIHHIIEHE?FHIJJ5AHHHGBDD@DA;;AC;;5;(;?>,:@C:>:ABD555>0?<+49@<:
@HWI-ST261:396:B0D48ABXX:8:1101:15519:3117 1:N:0:ACAGTG
CCGCGATATGCCGTCTCGACGCCGACAACGAGCATCATCAAGATAATCGACCACTTCTATGATCTGAAGCTCGGTTGTTGCCTCTTCTCTCCTCCAGTCT
+
@CCFFFFFHHHHHJJJJJJJJJJJJJJIJJJJJHHHHHHFFFEFFEEECEDDDDDDDD>DCCCA@CDDDDDDDDB@B@BA>@ACDDDD@CDD@<C#####
@HWI-ST261:396:B0D48ABXX:8:1101:15632:3220 1:Y:2016:ACAGTG
CGGAGAGGGAGTAGACGAGCTGCGGCAGCACCTCGTTCGAGACGACCGCCTCAGCGAGCTCGTCGTTGTAGTTGGCGAGGCGCCCGAGGGCGAGCGCTGC
+
@CCFFFFFHFHFHIJIJJJJJJJGJJGHEDAHIJGGIGIGHGFAD8>BDBBDDDA'5057@-&8;;@C?:>(4:>CB5<B9>-5@B##############
@HWI-ST261:396:B0D48ABXX:8:1103:3693:192960 1:N:514:ACAGTG
CTCGCCAACATCGCCGCCCCTATTTTGATGGAGTAGTACGCCCCTCGCCTCCGAACACAACTCATCCGATGGCATCACGTCGTTGGGCACTTGAGACCGG
+
@@@BFFDDHHHHHJJIIJJIDFHGIIE=GIJ3BFHHGIJCGHHHHHFFDBDDD8;?BCB@BACCDCCB<?9-<3@?C@0+8>BBC###############

As I said, I managed to replace the bit I wanted using the follwing script:

Code:

#!/bin/bash

sed -i 's! 1:[^:]*:[^:]*:GCCAAT!/1!g' /home/11/jc167987/NGSdata/Data/NQ017/NQ017_R1r.fastq &
sed -i 's! 1:[^:]*:[^:]*:CAGATC!/1!g' /home/11/jc167987/NGSdata/Data/NQ040/NQ040_R1r.fastq &
sed -i 's! 1:[^:]*:[^:]*:ACTTGA!/1!g' /home/11/jc167987/NGSdata/Data/NQ136/NQ136_R1r.fastq &
sed -i 's! 1:[^:]*:[^:]*:GATCAG!/1!g' /home/11/jc167987/NGSdata/Data/NQ283/NQ283_R1r.fastq &

sed -i 's! 2:[^:]*:[^:]*:GCCAAT!/2!g' /home/11/jc167987/NGSdata/Data/NQ017/NQ017_R2r.fastq &
sed -i 's! 2:[^:]*:[^:]*:CAGATC!/2!g' /home/11/jc167987/NGSdata/Data/NQ040/NQ040_R2r.fastq &
sed -i 's! 2:[^:]*:[^:]*:ACTTGA!/2!g' /home/11/jc167987/NGSdata/Data/NQ136/NQ136_R2r.fastq &
sed -i 's! 2:[^:]*:[^:]*:GATCAG!/2!g' /home/11/jc167987/NGSdata/Data/NQ283/NQ283_R2r.fastq &

I only tried the second command once in a script, which didn't work. But since the command above worked I didn't follow it up further, although the other command is a bit more elegant.

grail · 11-23-2011, 08:28 AM

No short option for version so you would need --version.

I would ask, does the above actually work, ie have the changes been made in the file(s)?

Are you pushing all the commands into the background because the file(s) are so large?

Lokelo · 11-23-2011, 09:14 AM

The files are about 5 Gb each. Am I correct in the understanding that if I didn't use &, they would just be carried out in sequence? And with the & they are done in parallel?

The above works and the changes were made in the files. I checked by grepping the adaptor sequences (I.e. the strings of A,C,T and G at the end of the ID) and the searches came up with nothing. Each file has about 30 million entries containing the four lines shown above. That's why I didn't find the variation in the numbers until I learned about grep, since they are comparably rare.

I have a limited time to assemble my transcriptomes from scratch and I as much as I would love to read up on everything I'm detail, I'm just focussing on what I need for the time being. Hence I use a lot of copying with just enough understanding to make it work.
However, I'm highly fascinated by this experience (when I was 16 I had the choice to go into chemistry or IT, and chemistry won, even though I still like using computers on a higher level) and certainly will get as much Linux knowledge as I can over time.

I will get the version number tomorrow morning.

jthill · 11-23-2011, 10:40 AM

No need for the version number: crts nailed it. Lose the bangs ("!").

Instead of s!this!that!g use s,this,that,g or s`this`that`g or whatever.

I like commas or backticks myself, they make a visible break. Until you have time to get better acquainted with shell syntax and its interactive assists, get in the habit of single-quoting any argument that has anything but alphanumerics or +-_/,. You'll gradually find more safe ones, but bang ("!") is high-priority metasyntax for interactively constructing command lines, fast, from pieces of earlier ones.

gtg, sorry if this was too elliptical, happy thanksgiving,
Jim

crts · 11-23-2011, 10:56 AM

Quote:

Originally Posted by Lokelo

I couldn't get the version of sed. I'm working remotly on a high performance computer and sed -V (or v) didn't come up with a version number.
I just started working with linux a month ago, so I'm not 100% sure yet what I'm doing all the time.

The lines I posted above containing the error message were directly copy/pasted from the terminal.

Crts, you said if I run it with double quotes it will try to redo a previous command, but I ran it with single quotes.

The thing is, it runs perfectly fine if I use it within a script (see below), but I got the error message when using them directly in the terminal.

Hmm, this is strange. But since you are working remotely I wonder which shell you are using on the remote system.
Another alternative would be to deactivate history expansion. In bash you can do it with

Code:

set +H

The reason why it works inside a script is because history expansion does not work inside a script.

Can you post the name of the system and the shell you are logged in? And how do you log in (ssh, telnet ...)? This is definitely not a 'sed' issue.

Lokelo · 11-23-2011, 06:15 PM

That seems to have done the trick. Thanks for all your help.
I agree, I like the look of commas. I will make sure that I acknowledge this forum in my thesis for your continued help!

Just to be complete:

Code:

/bin/tcsh
Linux login 2.6.32-131.17.1.el6.x86_64 #1 SMP Thu Sep 29 10:24:25 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 6.1 (Santiago)


GNU sed version 4.2.1
Copyright (C) 2009 Free Software Foundation, Inc.