[SOLVED] Extract a substring using regular expression with SED

PenguinJr · 05-07-2011, 06:11 PM

Hello,

I've spent most of the evening browsing the web, trying many things I've found on various forums, but nothing seems to work.

Please let me submit my problem : I have a test.txt file containing many lines like the following ones :

...
<insert_random_text>228.00 €<insert_more_random_text>
<insert_random_text>17.50 €<insert_more_random_text>
<insert_random_text>1238.13 €<insert_more_random_text>
...

And I want to extract :

...
228.00
17.50
1238.13
...

There is always one occurrence of € in each line. I want the numeric value that precedes this € occurrence. The random text (before and after) may contain numbers too, so the € may be important to parse, in order to correctly identify the number to return. The last character that precedes the number to extract is always a ">" (coming from an HTML tag).

Thanks for your help !
If you give a solution, could you please explain in detail the syntax that you use ?

David the H. · 05-08-2011, 12:13 AM

Not a difficult job. Just match and extract everything that comes between ">" and " &euro".

The only other consideration is working around the "/" characters that are common in html, which is easy to do simply by changing the separator character sed uses.

Code:

sed -rn '\|euro| s|.*>([0-9.]+) &euro.*|\1|p'

-r		:turn on extended regex.
-n		:don't print every line.

\|euro|		:match only lines containing "euro".  The address
		:pattern traditionally uses /string/, but you can
		:change it to a different character by preceding
		:it with a backslash.

s|x|y|		:the standard sed substitution pattern.  Again, it's 
		:traditionally s/x/y/, but any basic ascii character
		:can be used.

.*>		:a string of any kind of character, ending with ">".

(..)		:designates the part of the match to be captured.

[0-9.]+		:a string of digits and/or periods of any length
		:(but at least one).

 &euro.*	:followed by [space]&euro, and the rest of the line.

\1		:insert the captured part into the output string.

p		:print the results.

grail · 05-08-2011, 12:40 AM

Another alternative:

Code:

sed -rn '/euro/s/^[^0-9]*|[^0-9]*$//gp' file

Or maybe easier with awk:

Code:

awk -F"[ >]" '/euro/{print $2}' file

SigTerm · 05-08-2011, 02:17 AM

Quote:

Originally Posted by PenguinJr

I have a test.txt file containing many lines like the following ones :

...
<insert_random_text>228.00 €<insert_more_random_text>
<insert_random_text>17.50 €<insert_more_random_text>
<insert_random_text>1238.13 €<insert_more_random_text>
...

And I want to extract :

...
228.00
17.50
1238.13
...

Code:

sed -r "s/<[^<>]+>([0-9]+(\.[0-9]+){0,1})[^<>]*<[^<>]+>/\1/"< input.txt

where input.txt is source file.

Quote:

Originally Posted by PenguinJr

If you give a solution, could you please explain in detail the syntax that you use ?

Sed tutorial

PenguinJr · 05-08-2011, 03:46 AM

Thank you very much for all these answers, and especially for the syntax details !
I don't have much time right now to check all this, but I'll do it thoroughly later and tell you what works the best, and what I don't understand (if any).
I'll also let you know my previous own solution (that didn't work...) in order for you to tell me, if possible, whay I did wrong

Oh and thank you too for the impressive quickness of your answers !
Cya later !

markush · 05-09-2011, 03:36 PM

Hello together,

inspired by this thread (and since I am another time reading "Mastering Regular Expressions" By Jeffrey Friedl) I tried to solve the problem with a Perl-oneliner, here it is

Code:

perl -n -e 'm/((?:\d*)(?:\.\d{0,2}))(?:\s\&euro)/ && {print "$1\n"}' file

this works with PenguinJr's example very well, but I have a question. I expected my code to work for any possible pattern of the currency, 34.89, .78, 344.2 and 60 everything up to two decimal places (9.123 should not match). But my code doesn't match a number alone. My example:

Code:

<insert_random_text>228.00 &euro;<insert_more_random_text>
<insert_random_text>17 &euro;<insert_more_random_text>
<insert_random_text>1238.13 &euro;<insert_more_random_text>
<insert_random_text>1238.137 &euro;<insert_more_random_text>
<insert_random_text>1238.1 &euro;<insert_more_random_text>
<insert_random_text>.12 &euro;<insert_more_random_text>

yields the ouput

Code:

but I expected the number 17 to be also matched and extracted. What I mean is the expression (?:\.\d{0,2}) should match a decimal-point and 0 up to 2 digits. But why doesn't it work this way?

Thanks in advance (and thanks to PenguinJr for the challanging problem

)

Markus

grail · 05-09-2011, 10:27 PM

Well I will let you solve Markus, but the question I ask you back is, on the line that has 17, where is the decimal point? Remembering you have said how
many digits.

markush · 05-10-2011, 01:23 AM

Hello grail,

thanks for the answer, after sleeping on it I found the solution. Since the pattern (?:\.\d{0,2}) means "at least a decimal point..." it did not work as I expected. Now my problem is, when changing to (?:\.?\d{0,2}) it matches numbers with more than 2 decimal places and the result is (with my example from above)

Code:

I think I'll have to puzzle on this for a while.

Markus

David the H. · 05-11-2011, 11:34 AM

Quote:

Originally Posted by markush

Hello grail,
Now my problem is, when changing to (?:\.?\d{0,2}) it matches numbers with more than 2 decimal places and the result is (with my example from above)

Code:

I think I'll have to puzzle on this for a while.

Markus

Let's start by stripping off the (apparently perl-specific) "(?:)" brackets so we can look at the regex itself more clearly.

(BTW, I'm not very familiar with perl. What are they even there for? Everything appears to function fine without them.)

Code:

(\d*\.?\d{0,2})\s\&euro

The way I read it, it says "any number of digits, followed by an optional decimal, followed by zero to two digits, followed by \s&euro".

What I believe is happening is, since anything with more than two decimal places invalidates the "\.?\d{0,2}" part, then the regex behaves as if it's actually "\d*\s&euro". And in the string "1238.137 &euro" that means only "137 &euro" matches.

The only reliable way I can find to work around this is to ensure that there's some kind of anchoring match at the beginning of the number string. This seems to do the job:

Code:

perl -n -e 'm/(?:[^\d.])((?:\d*)(?:(\.\d{0,2})?))(?:\s\&euro)/ && {print "$1\n"}'

#or without the cruft; appears to give identical results.

perl -n -e 'm/[^\d.](\d*(\.\d{0,2})?)\s\&euro/ && {print "$1\n"}'

Which is similar to what I was doing in sed up above, only I just used ">" as the beginning match, since the OP said that's what it would always be.

Notice how you can also make the entire "\.\d{0,2}" string optional. Not that it makes any difference here.

There's one small side effect with the above though, in that it won't match if there are two periods or a number+period in front of the string. ">..12 €<" and ">0.12.25 €<" won't match, for example.

Perhaps something better could be done with a look-ahead match of some kind, but I don't know enough about those yet to figure it out myself.

markush · 05-11-2011, 12:47 PM

Hello David the H,

thanks for the response. The (?:...) construct is one of the extended features of Perl, it means that the brackets group the pattern but without capturing the matching string in a variable $1,$2... This is actually only useful if one has a very large inputfile since then the number of stored variables decreases significantly. As I wrote I'm reading the book "Mastering Regular Expressions" (I've read it last year for the first time) which is very interesting and I took this example just for fun.

Code:

Perhaps something better could be done with a look-ahead match of some kind, but I don't know enough about those yet to figure it out myself.

this is indeed what I'm looking for, but I haven't yet read the complete chapter in the book

.

Anyway my intention was to alter the question in "how can I extract valid currency-notations out of a textfile?". So I did not use the "<" and ">" characters. The problem I have is the string "1238.137 &euro" since I wanted to match only values with up to two decimal places whereas in your example the last digit "7" is cutted off. But Perl can handle lookahead and lookbehind and I'm trying to find out how I can use them for this problem.

I'll post the solution when it's ready, thanks again for your effort.

Markus