LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices



Reply
 
Search this Thread
Old 03-25-2011, 05:20 PM   #1
TheCrow33
Member
 
Registered: Aug 2009
Posts: 81

Rep: Reputation: 8
Regular Expressions using sed


I've used regular expressions before to an extent, and currently I have a problem where I must absolutely use a regular expression to (based on the title of a web page) replace the text '{Article}' with the contents of a HTML comment.

So first I was experimenting with sed on the command line to get this working to my liking, but I've hit a stopping point very early on in trying to do this. I've never worked with regular expression conditionals, so I tried to start off with just an If Then without and else. Sed keeps complaining to me that "sed: -e expression #1, char 84: unknown option to `s'". I can't even pinpoint character 84 because I'm not sure where exactly it starts counting. Anyway here's the command I'm using, and I'd appreciate any help in pinpointing this error.

echo $T | sed 's/(?(\(.*\)\(<title>Replacer</title>\)\(.*\){Article}\(.*\))(.*<!--Not:\(.*\)-->))/\1\2\3\5\4/'

where the variable T holds the contents of a file (i.e. T=`cat test.html`). Where test.html is the following:

Code:
<html>
<head>
<title>Replacer</title>
<!--Replacer:Hello Motto-->
<!--Not:Hello World-->
</head>
<body>
{Article}
</body>
</html>
 
Old 03-25-2011, 05:43 PM   #2
crts
Senior Member
 
Registered: Jan 2010
Posts: 1,604

Rep: Reputation: 446Reputation: 446Reputation: 446Reputation: 446Reputation: 446
Hi,

try using another delimiter, e. g. "|"
Code:
echo $T|sed 's|(?(\(.*\)\(<title>Replacer</title>\)\(.*\){Article}\(.*\))(.*<!--Not:\(.*\)-->))|\1\2\3\5\4|'
However, I am not quite sure what you expect the output to look like. Please post some sample data of the expected output since the command above does not throw an error but it also does not do any replacement/rearrangement.
 
Old 03-25-2011, 07:05 PM   #3
TheCrow33
Member
 
Registered: Aug 2009
Posts: 81

Original Poster
Rep: Reputation: 8
Well that certainly stops the error, so thanks for that bit. I'm trying to make it take the file I attached in the last post and turn it into something like this:

Code:
<html>
<head>
<title>Replacer</title>
<!--Replacer:Hello Motto-->
<!--Not:Hello World-->
</head>
<body>
Hello World
</body>
</html>
But only replace {Article} if the title is Replacer. So I thought that the if part of the statement: (?(\(.*\)\(<title>Replacer</title>\)\(.*\){Article}\(.*\)) Would make sure that it found both Replacer in the title and make sure that {Article} was somewhere else in the document (not in title). And to my understanding (apparently not the correct understanding haha) if the first condition (the part in the if) is met then it moves on to the then clause (which I thought would search the document for the text within). So basically after it makes sure that Replacer is the title and {Article} is contained I want it to find the comment that starts <!--Not: and take all the text inside it: "Hello World" and place it in the place of {Article}.

Obviously my regex is a bit messed up.
 
Old 03-25-2011, 07:30 PM   #4
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950Reputation: 1950
sed, the "stream editor", is designed for single-line edits and not well suited for multi-line work or complex conditional stuff like this. You really should use awk or perl, or even a dedicated html parsing program for this. With sed, you'd probably have to create a complex expression with the N option, nested commands, and maybe even conditional loops and use of the hold buffer. And even that wouldn't be ideal since html is a highly unstructured format.

BTW, though, you can avoid having to use most of those backslash escapes by enabling the -r "regex" option.

Here are a few useful sed references:
http://www.grymoire.com/Unix/Sed.html
http://sed.sourceforge.net/sedfaq.html
http://sed.sourceforge.net/sed1line.txt
 
Old 03-25-2011, 07:49 PM   #5
kurumi
Member
 
Registered: Apr 2010
Posts: 223

Rep: Reputation: 45
Code:
$ awk '/title/&&/Replacer/{f=1}f&&/Article/{next}1' file
Ruby(1.9+)
Code:
$ ruby -ne 'f=1 if /title/&&/Replacer/; next if f&&/Article/;print' file
 
Old 03-25-2011, 08:17 PM   #6
crts
Senior Member
 
Registered: Jan 2010
Posts: 1,604

Rep: Reputation: 446Reputation: 446Reputation: 446Reputation: 446Reputation: 446
As David said, editing HTML via sed is complicated.
Here are two sed's that work with your sample data. You might have to add appropriate [[:blank:]]* in the regEx if, e. g., there were spaces like in
Code:
<!--   Not:
In case {Article} is really just one line:
Code:
sed -r '/<title>Replacer<\/title>/ {:a N;/<!--Not:/ {h;b};/\n<!--[^\n]+$/ ba};/\{Article\}/ {x;s/.*<!--Not:(.*)-->$/\1/;t;x};' file
Or if there are several lines between the <body> tags:
Code:
sed -r '/<title>Replacer<\/title>/ {:a N;/<!--Not:/ {h;b};/\n<!--[^\n]+$/ ba};/<body>/ {:b N; /<\/body>/! bb; x;s/.*<!--Not:(.*)-->$/<body>\n\1\n<\/body>/;t;x};' file
If "<!--Not" is present {Article} will be replaced, otherwise not.

If your data is not strictly arranged as in your sample then you should consider using an HTML parser.

[EDIT]
Notice, that the above sed statements edit the file. Do not echo the variable via pipe into it. They both will not work if you echo the file as a single huge line into a pipe.

Last edited by crts; 03-25-2011 at 08:23 PM.
 
  


Reply

Tags
regular expressions, sed


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] sed 's/Tb05.5K5.100/Tb229/' alone but doesn't work in sed file w/ other expressions Radha.jg Programming 6 03-03-2011 08:59 AM
extract substring using sed and regular expressions (regexp) lindylex Programming 20 12-22-2009 11:41 AM
[SOLVED] sed, awk, Keep only text between two regular expressions scott_audio Linux - Newbie 9 08-06-2009 03:46 PM
Sed/awk help with regular expressions needed AP81 Programming 3 07-28-2008 08:26 AM
Sed and regular expressions tchernobog Linux - Software 2 08-14-2003 01:41 PM


All times are GMT -5. The time now is 09:27 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration