extract substring using sed and regular expressions (regexp)

syg00 · 12-21-2009, 05:50 PM

As I said above, grep is the better option. Else see the manpage for sed - "sed -n ..." and print any matches.

ghostdog74 · 12-21-2009, 06:29 PM

Quote:

Originally Posted by warrentaylor

I am having the same problem....sort of. I want to extract a combination of character if they exist. If they don't exist, I want nothing. My problem is that if my pattern doesn't exist, I get the whole line returned.

if I have .....AA9999999999999999....., I want AA9999999999999999
if I have ............................, I want nothing.

where AA9999999999999999 is 2 capital alphas followed by 16 numerics.

I use 's/.*\(AA[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\).*/\1/'

because \{16\} as a repeater doesn't work.

show examples of your data what you want to get exactly.

warrentaylor · 12-22-2009, 08:54 AM

for the purpose of the question, these are examples. Of both the data and the regex. I will go off and try 'print' and maybe grep but I haven't yet found how to extract data using grep.

ghostdog74 · 12-22-2009, 09:09 AM

i mean show a better input sample file, and show the output you want clearly.

warrentaylor · 12-22-2009, 10:31 AM

for ANY string

.......AA9999999999999999......

where '.' represents any character, I want to extract only the characters that fit the pattern AA followed by 16 numerics. Any digits in these positions is a match and the pattern could exist anywhere in the line. If this 'exact' pattern is not found then output nothing.

the above suggested solution actually worked for me:

'grep -o --only-matching "ZR[0-9]\{16\}"'

and I have what I want. So, thanks much.

Sorry for the confusion on the example.

David the H. · 12-22-2009, 10:41 AM

Quote:

Originally Posted by warrentaylor

if I have .....AA9999999999999999....., I want AA9999999999999999
if I have ............................, I want nothing.

where AA9999999999999999 is 2 capital alphas followed by 16 numerics.

I use 's/.*\(AA[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\).*/\1/'

because \{16\} as a repeater doesn't work.

Repeaters are extended regexp functions, meaning you have use "sed -r" or "grep -E/egrep". On the plus side, this also means you don't have to escape the parentheses or brackets.

Also, it's recommended to use the posix matching classes for the standard ranges of characters. Either of the following should work:

Code:

sed -rn 's/.*(AA[[:digit:]]{16}).*/\1/p'

egrep -o 'AA[[:digit:]]{16}'

Note that there are a couple of weaknesses in the above, though they may or may not be a concern for you. First, it will match number strings of any length, but only print the first 16. Second, it will match any combination of numerals, meaning something like AA1234567890123456 will also match. I'm not sure what you'd need to do if you need to isolate only a single repeating number.