LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 02-25-2012, 02:24 PM   #1
Ashkhan
Member
 
Registered: Oct 2003
Distribution: Debian, Ubuntu, RHEL, CentOS, MacOS
Posts: 39

Rep: Reputation: Disabled
Question sed and regexp matching (GNU sed version 4.2.1)


I would like to extract a number from a string using sed and backreferencing.

Let's say:

Code:
i='something_1234.txt'
echo $i |sed 's/.*\([0-9]\+\).*/\1/'
There can be variable number of numbers: 1, 12, 123, 1234,...
Unfortunately, sed just ignores the + modifier. I also tried \{1,\} instead but it doesn't work too...
 
Old 02-25-2012, 02:48 PM   #2
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Blog Entries: 1

Rep: Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251
Does it have to be back-referencing? I think a quicker option would be:
Code:
sed 's/[^0-9]*//g'
 
2 members found this post helpful.
Old 02-25-2012, 03:01 PM   #3
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by sycamorex View Post
Code:
sed 's/[^0-9]*//g'
OP specifies "a number." Suppose his input line contains several numbers.

Code:
echo 'something_1q2r3s4.txt' |sed 's/[^0-9]*//g'
... produces ...
Code:
1234
Daniel B. Martin
 
Old 02-25-2012, 03:06 PM   #4
sycamorex
LQ Veteran
 
Registered: Nov 2005
Location: London
Distribution: Slackware64-current
Posts: 5,836
Blog Entries: 1

Rep: Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251Reputation: 1251
Quote:
Originally Posted by danielbmartin View Post
OP specifies "a number." Suppose his input line contains several numbers.

Code:
echo 'something_1q2r3s4.txt' |sed 's/[^0-9]*//g'
... produces ...
Code:
1234
Daniel B. Martin
Unless the OP defines his problem in a clear and definitive way, that's the best I/we can do. The way the OP formulated the problem suggests that it's a single "number" not containing non-numerical characters.
 
1 members found this post helpful.
Old 02-25-2012, 03:20 PM   #5
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by sycamorex View Post
The way the OP formulated the problem suggests that it's a single "number" not containing non-numerical characters.
You're right.

Reading his sed made me think his intended question was "Reading left-to-right, let me capture the first numeric string."

Daniel B. Martin
 
Old 02-25-2012, 05:17 PM   #6
millgates
Member
 
Registered: Feb 2009
Location: 192.168.x.x
Distribution: Slackware
Posts: 852

Rep: Reputation: 389Reputation: 389Reputation: 389Reputation: 389
Quote:
Originally Posted by Ashkhan View Post
Unfortunately, sed just ignores the + modifier. I also tried \{1,\} instead but it doesn't work too...
No, sed does not ignore the + modifier. The problem is in your regex logic:

Code:
.*\([0-9]\+\).*
You need to realize, that the * in sed is "greedy". It means that sed will read the pattern from left to right and match as many characters as possible so that the regex can still match the line. More specifically:

the first thing sed sees in your regex is the left .*. It will try to match as many characters as possible so that the rest of the regex can still match the rest of the line. Therefore , the left .* will match the string like this: "something_1234.txt", because then it will still have one digit left to match the [0-9]\+ expression and the right .* (the latter does not even need any characters to match). Only then will sed continue with [0-9]\+, which can at this point only match the last digit, because the first three are already "eaten" by the first .*. Therefore your sed command will output
Code:
$ echo something_1234.txt|sed 's/.*\([0-9]\+\).*/\1/'
4
To fix this, you must replace the first .* with something that will not be allowed to eat the digits:

Code:
sed 's/[^0-9]*\([0-9]\+\).*/\1/'
or, for the sake of whoever is going to maintain the code, using the -r option:

Code:
sed -r 's/[^0-9]*([0-9]+).*/\1/'
If you're fine with just removing everything that's not a digit, I would go with the fine solution mentioned by sycamorex.

Last edited by millgates; 02-25-2012 at 05:20 PM.
 
2 members found this post helpful.
Old 02-26-2012, 05:14 AM   #7
Ashkhan
Member
 
Registered: Oct 2003
Distribution: Debian, Ubuntu, RHEL, CentOS, MacOS
Posts: 39

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by sycamorex View Post
Does it have to be back-referencing? I think a quicker option would be:
Code:
sed 's/[^0-9]*//g'
Thanks guys for your help.

That regexp suggested by sycamorex is perfectly fine. I tend to overdo my regexps because I don't use them very often.

And thanks for the explanation about greediness, millgates.
 
1 members found this post helpful.
Old 02-26-2012, 02:55 PM   #8
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by sycamorex View Post
Code:
sed 's/[^0-9]*//g'
If I understand this sed it discards all non-numerics. That, apparently, is what OP desires. I'll offer another way to accomplish the same transformation.
Code:
tr -dc '0-9'
This method is easier to read (imho).
d and c are options for the translate.
"d" says "discard".
"c" says "complement".
so tr -dc '0-9' says "discard all characters other than 0 through 9."

Now you might run this tr against a file and want to preserve the NewLine characters. In that case, use
Code:
tr -dc '\n0-9'
A casual timing measurement with a large file shows the tr runs twice as fast as the sed.

Daniel B. Martin

Last edited by danielbmartin; 02-26-2012 at 04:24 PM. Reason: Correct punctuation, cover NewLine case
 
Old 02-27-2012, 09:12 AM   #9
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Of course, the solutions by sycamorex and danielbmartin will only work correctly if there's a single set of digits in the string, as they simply delete anything that isn't a number. A string like "1234_something_1234.txt" would end up as "12341234".

But assuming that's ok, then you don't even need to use an external tool. As long as the string is already in a variable, just use simple parameter substitution.

Code:
i='something_1234.txt'
echo "${i//[^0-9]}"
And if you need to be more careful about it:

Code:
i='something_1234.txt'
x=${i%.*}
x=${x##*_}
echo "$x"
These should run faster than any solution using external applications.

See here for plenty more string manipulations.
 
3 members found this post helpful.
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Problem with sed regexp Chipper Linux - General 12 03-19-2011 05:49 AM
deleting a line matching two or more regexp in bash, sed maybe? patolfo Programming 21 05-21-2010 12:30 PM
[SOLVED] Migrate Regexp from SED to AWK cgcamal Programming 9 04-23-2010 10:32 PM
vim or sed multiline regexp matching eentonig Programming 1 09-08-2008 09:06 AM
help with sed / regexp elinenbe Programming 2 02-01-2008 10:09 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 08:43 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration