[SOLVED] Extract a substring using regular expression with SED
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
There is always one occurrence of € in each line. I want the numeric value that precedes this € occurrence. The random text (before and after) may contain numbers too, so the € may be important to parse, in order to correctly identify the number to return. The last character that precedes the number to extract is always a ">" (coming from an HTML tag).
Thanks for your help !
If you give a solution, could you please explain in detail the syntax that you use ?
Not a difficult job. Just match and extract everything that comes between ">" and " &euro".
The only other consideration is working around the "/" characters that are common in html, which is easy to do simply by changing the separator character sed uses.
Code:
sed -rn '\|euro| s|.*>([0-9.]+) &euro.*|\1|p'
-r :turn on extended regex.
-n :don't print every line.
\|euro| :match only lines containing "euro". The address
:pattern traditionally uses /string/, but you can
:change it to a different character by preceding
:it with a backslash.
s|x|y| :the standard sed substitution pattern. Again, it's
:traditionally s/x/y/, but any basic ascii character
:can be used.
.*> :a string of any kind of character, ending with ">".
(..) :designates the part of the match to be captured.
[0-9.]+ :a string of digits and/or periods of any length
:(but at least one).
&euro.* :followed by [space]&euro, and the rest of the line.
\1 :insert the captured part into the output string.
p :print the results.
Last edited by David the H.; 05-08-2011 at 12:18 AM.
Reason: fixed an oops
Thank you very much for all these answers, and especially for the syntax details !
I don't have much time right now to check all this, but I'll do it thoroughly later and tell you what works the best, and what I don't understand (if any).
I'll also let you know my previous own solution (that didn't work...) in order for you to tell me, if possible, whay I did wrong
Oh and thank you too for the impressive quickness of your answers !
Cya later !
inspired by this thread (and since I am another time reading "Mastering Regular Expressions" By Jeffrey Friedl) I tried to solve the problem with a Perl-oneliner, here it is
this works with PenguinJr's example very well, but I have a question. I expected my code to work for any possible pattern of the currency, 34.89, .78, 344.2 and 60 everything up to two decimal places (9.123 should not match). But my code doesn't match a number alone. My example:
but I expected the number 17 to be also matched and extracted. What I mean is the expression (?:\.\d{0,2}) should match a decimal-point and 0 up to 2 digits. But why doesn't it work this way?
Thanks in advance (and thanks to PenguinJr for the challanging problem )
Well I will let you solve Markus, but the question I ask you back is, on the line that has 17, where is the decimal point? Remembering you have said how
many digits.
thanks for the answer, after sleeping on it I found the solution. Since the pattern (?:\.\d{0,2}) means "at least a decimal point..." it did not work as I expected. Now my problem is, when changing to (?:\.?\d{0,2}) it matches numbers with more than 2 decimal places and the result is (with my example from above)
Hello grail,
Now my problem is, when changing to (?:\.?\d{0,2}) it matches numbers with more than 2 decimal places and the result is (with my example from above)
Code:
228.00
17
1238.13
137
1238.1
.12
I think I'll have to puzzle on this for a while.
Markus
Let's start by stripping off the (apparently perl-specific) "(?:)" brackets so we can look at the regex itself more clearly.
(BTW, I'm not very familiar with perl. What are they even there for? Everything appears to function fine without them.)
Code:
(\d*\.?\d{0,2})\s\&euro
The way I read it, it says "any number of digits, followed by an optional decimal, followed by zero to two digits, followed by \s&euro".
What I believe is happening is, since anything with more than two decimal places invalidates the "\.?\d{0,2}" part, then the regex behaves as if it's actually "\d*\s&euro". And in the string "1238.137 &euro" that means only "137 &euro" matches.
The only reliable way I can find to work around this is to ensure that there's some kind of anchoring match at the beginning of the number string. This seems to do the job:
Code:
perl -n -e 'm/(?:[^\d.])((?:\d*)(?:(\.\d{0,2})?))(?:\s\&euro)/ && {print "$1\n"}'
#or without the cruft; appears to give identical results.
perl -n -e 'm/[^\d.](\d*(\.\d{0,2})?)\s\&euro/ && {print "$1\n"}'
Which is similar to what I was doing in sed up above, only I just used ">" as the beginning match, since the OP said that's what it would always be.
Notice how you can also make the entire "\.\d{0,2}" string optional. Not that it makes any difference here.
There's one small side effect with the above though, in that it won't match if there are two periods or a number+period in front of the string. ">..12 €<" and ">0.12.25 €<" won't match, for example.
Perhaps something better could be done with a look-ahead match of some kind, but I don't know enough about those yet to figure it out myself.
thanks for the response. The (?:...) construct is one of the extended features of Perl, it means that the brackets group the pattern but without capturing the matching string in a variable $1,$2... This is actually only useful if one has a very large inputfile since then the number of stored variables decreases significantly. As I wrote I'm reading the book "Mastering Regular Expressions" (I've read it last year for the first time) which is very interesting and I took this example just for fun.
Code:
Perhaps something better could be done with a look-ahead match of some kind, but I don't know enough about those yet to figure it out myself.
this is indeed what I'm looking for, but I haven't yet read the complete chapter in the book .
Anyway my intention was to alter the question in "how can I extract valid currency-notations out of a textfile?". So I did not use the "<" and ">" characters. The problem I have is the string "1238.137 &euro" since I wanted to match only values with up to two decimal places whereas in your example the last digit "7" is cutted off. But Perl can handle lookahead and lookbehind and I'm trying to find out how I can use them for this problem.
I'll post the solution when it's ready, thanks again for your effort.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.