LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-07-2011, 06:11 PM   #1
PenguinJr
LQ Newbie
 
Registered: May 2011
Posts: 7

Rep: Reputation: 0
Question Extract a substring using regular expression with SED


Hello,

I've spent most of the evening browsing the web, trying many things I've found on various forums, but nothing seems to work.

Please let me submit my problem : I have a test.txt file containing many lines like the following ones :

...
<insert_random_text>228.00 &euro;<insert_more_random_text>
<insert_random_text>17.50 &euro;<insert_more_random_text>
<insert_random_text>1238.13 &euro;<insert_more_random_text>
...

And I want to extract :

...
228.00
17.50
1238.13
...

There is always one occurrence of &euro; in each line. I want the numeric value that precedes this &euro; occurrence. The random text (before and after) may contain numbers too, so the &euro; may be important to parse, in order to correctly identify the number to return. The last character that precedes the number to extract is always a ">" (coming from an HTML tag).

Thanks for your help !
If you give a solution, could you please explain in detail the syntax that you use ?

Last edited by PenguinJr; 05-07-2011 at 06:27 PM.
 
Old 05-08-2011, 12:13 AM   #2
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Not a difficult job. Just match and extract everything that comes between ">" and " &euro".

The only other consideration is working around the "/" characters that are common in html, which is easy to do simply by changing the separator character sed uses.

Code:
sed -rn '\|euro| s|.*>([0-9.]+) &euro.*|\1|p'

-r		:turn on extended regex.
-n		:don't print every line.

\|euro|		:match only lines containing "euro".  The address
		:pattern traditionally uses /string/, but you can
		:change it to a different character by preceding
		:it with a backslash.

s|x|y|		:the standard sed substitution pattern.  Again, it's 
		:traditionally s/x/y/, but any basic ascii character
		:can be used.

.*>		:a string of any kind of character, ending with ">".

(..)		:designates the part of the match to be captured.

[0-9.]+		:a string of digits and/or periods of any length
		:(but at least one).

 &euro.*	:followed by [space]&euro, and the rest of the line.

\1		:insert the captured part into the output string.

p		:print the results.

Last edited by David the H.; 05-08-2011 at 12:18 AM. Reason: fixed an oops
 
1 members found this post helpful.
Old 05-08-2011, 12:40 AM   #3
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192
Another alternative:
Code:
sed -rn '/euro/s/^[^0-9]*|[^0-9]*$//gp' file
Or maybe easier with awk:
Code:
awk -F"[ >]" '/euro/{print $2}' file
 
Old 05-08-2011, 02:17 AM   #4
SigTerm
Member
 
Registered: Dec 2009
Distribution: Slackware 12.2
Posts: 379

Rep: Reputation: 234Reputation: 234Reputation: 234
Quote:
Originally Posted by PenguinJr View Post
I have a test.txt file containing many lines like the following ones :

...
<insert_random_text>228.00 &euro;<insert_more_random_text>
<insert_random_text>17.50 &euro;<insert_more_random_text>
<insert_random_text>1238.13 &euro;<insert_more_random_text>
...

And I want to extract :

...
228.00
17.50
1238.13
...
Code:
sed -r "s/<[^<>]+>([0-9]+(\.[0-9]+){0,1})[^<>]*<[^<>]+>/\1/"< input.txt
where input.txt is source file.

Quote:
Originally Posted by PenguinJr View Post
If you give a solution, could you please explain in detail the syntax that you use ?
Sed tutorial
 
Old 05-08-2011, 03:46 AM   #5
PenguinJr
LQ Newbie
 
Registered: May 2011
Posts: 7

Original Poster
Rep: Reputation: 0
Thank you very much for all these answers, and especially for the syntax details !
I don't have much time right now to check all this, but I'll do it thoroughly later and tell you what works the best, and what I don't understand (if any).
I'll also let you know my previous own solution (that didn't work...) in order for you to tell me, if possible, whay I did wrong
Oh and thank you too for the impressive quickness of your answers !
Cya later !
 
Old 05-09-2011, 03:36 PM   #6
markush
Senior Member
 
Registered: Apr 2007
Location: Germany
Distribution: Slackware
Posts: 3,979

Rep: Reputation: Disabled
Hello together,

inspired by this thread (and since I am another time reading "Mastering Regular Expressions" By Jeffrey Friedl) I tried to solve the problem with a Perl-oneliner, here it is
Code:
perl -n -e 'm/((?:\d*)(?:\.\d{0,2}))(?:\s\&euro)/ && {print "$1\n"}' file
this works with PenguinJr's example very well, but I have a question. I expected my code to work for any possible pattern of the currency, 34.89, .78, 344.2 and 60 everything up to two decimal places (9.123 should not match). But my code doesn't match a number alone. My example:
Code:
<insert_random_text>228.00 &euro;<insert_more_random_text>
<insert_random_text>17 &euro;<insert_more_random_text>
<insert_random_text>1238.13 &euro;<insert_more_random_text>
<insert_random_text>1238.137 &euro;<insert_more_random_text>
<insert_random_text>1238.1 &euro;<insert_more_random_text>
<insert_random_text>.12 &euro;<insert_more_random_text>
yields the ouput
Code:
228.00
1238.13
1238.1
.12
but I expected the number 17 to be also matched and extracted. What I mean is the expression (?:\.\d{0,2}) should match a decimal-point and 0 up to 2 digits. But why doesn't it work this way?

Thanks in advance (and thanks to PenguinJr for the challanging problem )

Markus
 
Old 05-09-2011, 10:27 PM   #7
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,007

Rep: Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192Reputation: 3192
Well I will let you solve Markus, but the question I ask you back is, on the line that has 17, where is the decimal point? Remembering you have said how
many digits.
 
1 members found this post helpful.
Old 05-10-2011, 01:23 AM   #8
markush
Senior Member
 
Registered: Apr 2007
Location: Germany
Distribution: Slackware
Posts: 3,979

Rep: Reputation: Disabled
Hello grail,

thanks for the answer, after sleeping on it I found the solution. Since the pattern (?:\.\d{0,2}) means "at least a decimal point..." it did not work as I expected. Now my problem is, when changing to (?:\.?\d{0,2}) it matches numbers with more than 2 decimal places and the result is (with my example from above)
Code:
228.00
17
1238.13
137
1238.1
.12
I think I'll have to puzzle on this for a while.

Markus
 
Old 05-11-2011, 11:34 AM   #9
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Quote:
Originally Posted by markush View Post
Hello grail,
Now my problem is, when changing to (?:\.?\d{0,2}) it matches numbers with more than 2 decimal places and the result is (with my example from above)
Code:
228.00
17
1238.13
137
1238.1
.12
I think I'll have to puzzle on this for a while.

Markus
Let's start by stripping off the (apparently perl-specific) "(?:)" brackets so we can look at the regex itself more clearly.

(BTW, I'm not very familiar with perl. What are they even there for? Everything appears to function fine without them.)

Code:
(\d*\.?\d{0,2})\s\&euro
The way I read it, it says "any number of digits, followed by an optional decimal, followed by zero to two digits, followed by \s&euro".

What I believe is happening is, since anything with more than two decimal places invalidates the "\.?\d{0,2}" part, then the regex behaves as if it's actually "\d*\s&euro". And in the string "1238.137 &euro" that means only "137 &euro" matches.

The only reliable way I can find to work around this is to ensure that there's some kind of anchoring match at the beginning of the number string. This seems to do the job:
Code:
perl -n -e 'm/(?:[^\d.])((?:\d*)(?:(\.\d{0,2})?))(?:\s\&euro)/ && {print "$1\n"}'

#or without the cruft; appears to give identical results.

perl -n -e 'm/[^\d.](\d*(\.\d{0,2})?)\s\&euro/ && {print "$1\n"}'
Which is similar to what I was doing in sed up above, only I just used ">" as the beginning match, since the OP said that's what it would always be.

Notice how you can also make the entire "\.\d{0,2}" string optional. Not that it makes any difference here.

There's one small side effect with the above though, in that it won't match if there are two periods or a number+period in front of the string. ">..12 &euro;<" and ">0.12.25 &euro;<" won't match, for example.

Perhaps something better could be done with a look-ahead match of some kind, but I don't know enough about those yet to figure it out myself.
 
1 members found this post helpful.
Old 05-11-2011, 12:47 PM   #10
markush
Senior Member
 
Registered: Apr 2007
Location: Germany
Distribution: Slackware
Posts: 3,979

Rep: Reputation: Disabled
Hello David the H,

thanks for the response. The (?:...) construct is one of the extended features of Perl, it means that the brackets group the pattern but without capturing the matching string in a variable $1,$2... This is actually only useful if one has a very large inputfile since then the number of stored variables decreases significantly. As I wrote I'm reading the book "Mastering Regular Expressions" (I've read it last year for the first time) which is very interesting and I took this example just for fun.

Code:
Perhaps something better could be done with a look-ahead match of some kind, but I don't know enough about those yet to figure it out myself.
this is indeed what I'm looking for, but I haven't yet read the complete chapter in the book .

Anyway my intention was to alter the question in "how can I extract valid currency-notations out of a textfile?". So I did not use the "<" and ">" characters. The problem I have is the string "1238.137 &euro" since I wanted to match only values with up to two decimal places whereas in your example the last digit "7" is cutted off. But Perl can handle lookahead and lookbehind and I'm trying to find out how I can use them for this problem.

I'll post the solution when it's ready, thanks again for your effort.

Markus
 
  


Reply

Tags
perl, regular expression, sed



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Extract substring using sed hweontey Linux - Newbie 1 02-22-2011 03:27 AM
extract substring using sed and regular expressions (regexp) lindylex Programming 20 12-22-2009 10:41 AM
sed - regular expression Vilmerok Programming 5 02-26-2009 08:44 AM
sed regular expression Ammad Linux - General 7 10-29-2008 05:52 PM
Extract substring matching a regular expression tikit Linux - General 2 02-18-2008 01:47 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:59 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration