Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place! |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
|
01-07-2008, 02:13 PM
|
#1
|
LQ Newbie
Registered: Oct 2007
Posts: 26
Rep:
|
how to look for the shortest match using regex, bascially the opposite of .*
hi,
i'm have a problem in the following situation:
suppose, i have a string "Scrapple from the apple."
then, if i use the regular expression "a.*e", it will match: "apple from the apple", because by definition using the .* will match the longest string that will match the regular expression.
and it won't match: "apple" OR "apple from the" even though these also start with "a" and end with an "e".
my problem is that instead of looking for the longest match, i want the shortest match. i've looked at the tutorials, but am still at a loss on how to do this. any help will be much appreciated.
|
|
|
01-07-2008, 02:45 PM
|
#2
|
LQ Veteran
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809
|
Here is just one crude method (using sed):
sed 's/a.\{,3\}e/G/g' filename
This matches the pattern: "a" + a maximum of 3 characters + "e", and replaces it with "G" for all occurences.
|
|
|
01-07-2008, 02:52 PM
|
#3
|
Senior Member
Registered: Jun 2003
Location: California
Distribution: Slackware
Posts: 1,181
Rep:
|
You want the "non-greedy" matching operators. In perl, for example, if you used .+? it will match on the first character (beware with using .*? -- it will happily match on 0 characters and end).
|
|
|
01-07-2008, 03:12 PM
|
#4
|
LQ Newbie
Registered: Oct 2007
Posts: 26
Original Poster
Rep:
|
here is an example of what i'm trying to do.
i'm trying to delete everything between and including <tag1> and </tag1>. but anything that's outside of this should not be deleted.
i'm doing this with a sed script, and the regex is not working.
[HTML]
<html><body><tag1>This is inside tag1. This should be deleted.</tag1>This is the first statement outside of tag1. This should NOT be deleted.<tag1>This is once again inside tag1. This should be deleted as well.</tag1> This is the second statement outside tag1. This should NOT be deleted.</body></html>
[/HTML]
i've tried the following:
in this one the problem is that it deletes the first line outside <tag1> as well.
Code:
$cat test1 | sed 's/<tag1>.*<\/tag1>//'
<html><body>This is the second statement outside tag1. This should not be deleted.</body></html>
in this one the problem is that it does not delete anything:
Code:
$cat test1 | sed 's/<tag1>.+?<\/tag1>//'
<html><body><tag1>This is inside tag1. This should be deleted.</tag1>This is the first statement outside tag1. This should not be deleted.<tag1>This is once again inside tag1. This should be deleted as well.</tag1>This is the second statement outside tag1. This should not be deleted.</body></html>
i've tried several variants of the regex above... but am still at a loss on how to do this... any guidance will be helpful. thanks!
|
|
|
01-07-2008, 03:27 PM
|
#5
|
Senior Member
Registered: Jun 2003
Location: California
Distribution: Slackware
Posts: 1,181
Rep:
|
What about using "s/(^.*<tag1>).*(</tag1>.*$)/\1\2/"? That will save everything up to and including <tag1> from the start of the line, and then save everything after and including </tag1>, to the end of the line, chopping the middle. An inelegant solution, I know, but something that may work until something better comes along.
|
|
|
01-07-2008, 03:32 PM
|
#6
|
LQ Veteran
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809
|
In your first example, the regex is "greedy"--ie it goes all the way to the last instance of </tag1>.
In addition to my earlier crude solution (max # of characters), you could also do this:
sed -e 's/\/tag1/TAGONE/' -e 's/<tag1>.*<TAGONE>//' (By replacing only the first instance of "/tag1" you create an unambiguous endpoint for the second SED command.)
My favorite SED tutorial:
http://www.grymoire.com/Unix/Sed.html
I don't know if this is legal--I have never seen in in Bash. "+" means one or more, and "?" means optional---I don't know what the combo would mean. Maybe try it on something simple.....
|
|
|
01-07-2008, 03:42 PM
|
#7
|
Senior Member
Registered: Jun 2003
Location: California
Distribution: Slackware
Posts: 1,181
Rep:
|
Quote:
Originally Posted by pixellany
I don't know if this is legal--I have never seen in in Bash. "+" means one or more, and "?" means optional---I don't know what the combo would mean. Maybe try it on something simple.....
|
His question just mentioned regular expressions, context-unspecific. In perl (where I do 99% of my personal RegExp work) one can use the '?' operator to turn a greedy operator into a non-greedy one.
|
|
|
01-08-2008, 09:29 AM
|
#8
|
LQ Veteran
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809
|
This works:
Code:
sed 's/[0-9]\{2\},\+\?[0-9]\{2\}/DDD/g' filename
this finds all instances of 2 digits + optional 1 or more commas, plus 2 more digits.
From all my reading, I had no idea that the construct in bold/underline would work.
It seems that this has a very different meaning from the perl one.
EDIT--PS: Works in grep, too...
Last edited by pixellany; 01-08-2008 at 09:35 AM.
|
|
|
01-08-2008, 10:21 AM
|
#9
|
Senior Member
Registered: Aug 2006
Posts: 2,697
|
Quote:
Originally Posted by new_2_unix
i'm doing this with a sed script, and the regex is not working.
|
Sometimes regex can be a pain, if you don't understand it enough. Until you get to know it better, here's one way without regex (at least not too much)
Code:
awk 'BEGIN{ FS="</tag1>"}
{
for ( i=1 ; i<=NF; i++ ) {
if ( $(i) ~ /<tag1>/ ) {
split( $(i) , a, /<tag1>/ )
print a[1]
}else {
print $(i)
}
}
}
' "file"
|
|
|
All times are GMT -5. The time now is 05:53 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|