sed and regexp matching (GNU sed version 4.2.1)

Ashkhan · 02-25-2012, 02:24 PM

I would like to extract a number from a string using sed and backreferencing.

Let's say:

Code:

i='something_1234.txt'
echo $i |sed 's/.*\([0-9]\+\).*/\1/'

There can be variable number of numbers: 1, 12, 123, 1234,...
Unfortunately, sed just ignores the + modifier. I also tried \{1,\} instead but it doesn't work too...

sycamorex · 02-25-2012, 02:48 PM

Does it have to be back-referencing? I think a quicker option would be:

Code:

sed 's/[^0-9]*//g'

danielbmartin · 02-25-2012, 03:01 PM

Quote:

Originally Posted by sycamorex

Code:

sed 's/[^0-9]*//g'

OP specifies "a number." Suppose his input line contains several numbers.

Code:

echo 'something_1q2r3s4.txt' |sed 's/[^0-9]*//g'

... produces ...

Code:

Daniel B. Martin

sycamorex · 02-25-2012, 03:06 PM

Quote:

Originally Posted by danielbmartin

OP specifies "a number." Suppose his input line contains several numbers.

Code:

echo 'something_1q2r3s4.txt' |sed 's/[^0-9]*//g'

... produces ...

Code:

Daniel B. Martin

Unless the OP defines his problem in a clear and definitive way, that's the best I/we can do. The way the OP formulated the problem suggests that it's a single "number" not containing non-numerical characters.

danielbmartin · 02-25-2012, 03:20 PM

Quote:

Originally Posted by sycamorex

The way the OP formulated the problem suggests that it's a single "number" not containing non-numerical characters.

You're right.

Reading his sed made me think his intended question was "Reading left-to-right, let me capture the first numeric string."

Daniel B. Martin

millgates · 02-25-2012, 05:17 PM

Quote:

Originally Posted by Ashkhan

Unfortunately, sed just ignores the + modifier. I also tried \{1,\} instead but it doesn't work too...

No, sed does not ignore the + modifier. The problem is in your regex logic:

Code:

.*\([0-9]\+\).*

You need to realize, that the * in sed is "greedy". It means that sed will read the pattern from left to right and match as many characters as possible so that the regex can still match the line. More specifically:

the first thing sed sees in your regex is the left .*. It will try to match as many characters as possible so that the rest of the regex can still match the rest of the line. Therefore , the left .* will match the string like this: "something_1234.txt", because then it will still have one digit left to match the [0-9]\+ expression and the right .* (the latter does not even need any characters to match). Only then will sed continue with [0-9]\+, which can at this point only match the last digit, because the first three are already "eaten" by the first .*. Therefore your sed command will output

Code:

$ echo something_1234.txt|sed 's/.*\([0-9]\+\).*/\1/'
4

To fix this, you must replace the first .* with something that will not be allowed to eat the digits:

Code:

sed 's/[^0-9]*\([0-9]\+\).*/\1/'

or, for the sake of whoever is going to maintain the code, using the -r option:

Code:

sed -r 's/[^0-9]*([0-9]+).*/\1/'

If you're fine with just removing everything that's not a digit, I would go with the fine solution mentioned by sycamorex.

Ashkhan · 02-26-2012, 05:14 AM

Quote:

Originally Posted by sycamorex

Does it have to be back-referencing? I think a quicker option would be:

Code:

sed 's/[^0-9]*//g'

Thanks guys for your help.

That regexp suggested by sycamorex is perfectly fine. I tend to overdo my regexps because I don't use them very often.

And thanks for the explanation about greediness, millgates.

danielbmartin · 02-26-2012, 02:55 PM

Quote:

Originally Posted by sycamorex

Code:

sed 's/[^0-9]*//g'

If I understand this sed it discards all non-numerics. That, apparently, is what OP desires. I'll offer another way to accomplish the same transformation.

Code:

tr -dc '0-9'

This method is easier to read (imho).
d and c are options for the translate.
"d" says "discard".
"c" says "complement".
so tr -dc '0-9' says "discard all characters other than 0 through 9."

Now you might run this tr against a file and want to preserve the NewLine characters. In that case, use

Code:

tr -dc '\n0-9'

A casual timing measurement with a large file shows the tr runs twice as fast as the sed.

Daniel B. Martin

David the H. · 02-27-2012, 09:12 AM

Of course, the solutions by sycamorex and danielbmartin will only work correctly if there's a single set of digits in the string, as they simply delete anything that isn't a number. A string like "1234_something_1234.txt" would end up as "12341234".

But assuming that's ok, then you don't even need to use an external tool. As long as the string is already in a variable, just use simple parameter substitution.

Code:

i='something_1234.txt'
echo "${i//[^0-9]}"

And if you need to be more careful about it:

Code:

i='something_1234.txt'
x=${i%.*}
x=${x##*_}
echo "$x"

These should run faster than any solution using external applications.

See here for plenty more string manipulations.