LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   sed and regexp matching (GNU sed version 4.2.1) (https://www.linuxquestions.org/questions/programming-9/sed-and-regexp-matching-gnu-sed-version-4-2-1-a-931268/)

Ashkhan 02-25-2012 02:24 PM

sed and regexp matching (GNU sed version 4.2.1)
 
I would like to extract a number from a string using sed and backreferencing.

Let's say:

Code:

i='something_1234.txt'
echo $i |sed 's/.*\([0-9]\+\).*/\1/'

There can be variable number of numbers: 1, 12, 123, 1234,...
Unfortunately, sed just ignores the + modifier. I also tried \{1,\} instead but it doesn't work too...

sycamorex 02-25-2012 02:48 PM

Does it have to be back-referencing? I think a quicker option would be:
Code:

sed 's/[^0-9]*//g'

danielbmartin 02-25-2012 03:01 PM

Quote:

Originally Posted by sycamorex (Post 4611918)
Code:

sed 's/[^0-9]*//g'

OP specifies "a number." Suppose his input line contains several numbers.

Code:

echo 'something_1q2r3s4.txt' |sed 's/[^0-9]*//g'
... produces ...
Code:

1234
Daniel B. Martin

sycamorex 02-25-2012 03:06 PM

Quote:

Originally Posted by danielbmartin (Post 4611926)
OP specifies "a number." Suppose his input line contains several numbers.

Code:

echo 'something_1q2r3s4.txt' |sed 's/[^0-9]*//g'
... produces ...
Code:

1234
Daniel B. Martin

Unless the OP defines his problem in a clear and definitive way, that's the best I/we can do. The way the OP formulated the problem suggests that it's a single "number" not containing non-numerical characters.

danielbmartin 02-25-2012 03:20 PM

Quote:

Originally Posted by sycamorex (Post 4611928)
The way the OP formulated the problem suggests that it's a single "number" not containing non-numerical characters.

You're right.

Reading his sed made me think his intended question was "Reading left-to-right, let me capture the first numeric string."

Daniel B. Martin

millgates 02-25-2012 05:17 PM

Quote:

Originally Posted by Ashkhan (Post 4611908)
Unfortunately, sed just ignores the + modifier. I also tried \{1,\} instead but it doesn't work too...

No, sed does not ignore the + modifier. The problem is in your regex logic:

Code:

.*\([0-9]\+\).*
You need to realize, that the * in sed is "greedy". It means that sed will read the pattern from left to right and match as many characters as possible so that the regex can still match the line. More specifically:

the first thing sed sees in your regex is the left .*. It will try to match as many characters as possible so that the rest of the regex can still match the rest of the line. Therefore , the left .* will match the string like this: "something_1234.txt", because then it will still have one digit left to match the [0-9]\+ expression and the right .* (the latter does not even need any characters to match). Only then will sed continue with [0-9]\+, which can at this point only match the last digit, because the first three are already "eaten" by the first .*. Therefore your sed command will output
Code:

$ echo something_1234.txt|sed 's/.*\([0-9]\+\).*/\1/'
4

To fix this, you must replace the first .* with something that will not be allowed to eat the digits:

Code:

sed 's/[^0-9]*\([0-9]\+\).*/\1/'
or, for the sake of whoever is going to maintain the code, using the -r option:

Code:

sed -r 's/[^0-9]*([0-9]+).*/\1/'
If you're fine with just removing everything that's not a digit, I would go with the fine solution mentioned by sycamorex.

Ashkhan 02-26-2012 05:14 AM

Quote:

Originally Posted by sycamorex (Post 4611918)
Does it have to be back-referencing? I think a quicker option would be:
Code:

sed 's/[^0-9]*//g'

Thanks guys for your help.

That regexp suggested by sycamorex is perfectly fine. I tend to overdo my regexps because I don't use them very often. :)

And thanks for the explanation about greediness, millgates.

danielbmartin 02-26-2012 02:55 PM

Quote:

Originally Posted by sycamorex (Post 4611918)
Code:

sed 's/[^0-9]*//g'

If I understand this sed it discards all non-numerics. That, apparently, is what OP desires. I'll offer another way to accomplish the same transformation.
Code:

tr -dc '0-9'
This method is easier to read (imho).
d and c are options for the translate.
"d" says "discard".
"c" says "complement".
so tr -dc '0-9' says "discard all characters other than 0 through 9."

Now you might run this tr against a file and want to preserve the NewLine characters. In that case, use
Code:

tr -dc '\n0-9'
A casual timing measurement with a large file shows the tr runs twice as fast as the sed.

Daniel B. Martin

David the H. 02-27-2012 09:12 AM

Of course, the solutions by sycamorex and danielbmartin will only work correctly if there's a single set of digits in the string, as they simply delete anything that isn't a number. A string like "1234_something_1234.txt" would end up as "12341234".

But assuming that's ok, then you don't even need to use an external tool. As long as the string is already in a variable, just use simple parameter substitution.

Code:

i='something_1234.txt'
echo "${i//[^0-9]}"

And if you need to be more careful about it:

Code:

i='something_1234.txt'
x=${i%.*}
x=${x##*_}
echo "$x"

These should run faster than any solution using external applications.

See here for plenty more string manipulations.


All times are GMT -5. The time now is 04:31 AM.