LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Regular Expressions using sed (http://www.linuxquestions.org/questions/programming-9/regular-expressions-using-sed-871022/)

TheCrow33 03-25-2011 04:20 PM

Regular Expressions using sed
 
I've used regular expressions before to an extent, and currently I have a problem where I must absolutely use a regular expression to (based on the title of a web page) replace the text '{Article}' with the contents of a HTML comment.

So first I was experimenting with sed on the command line to get this working to my liking, but I've hit a stopping point very early on in trying to do this. I've never worked with regular expression conditionals, so I tried to start off with just an If Then without and else. Sed keeps complaining to me that "sed: -e expression #1, char 84: unknown option to `s'". I can't even pinpoint character 84 because I'm not sure where exactly it starts counting. Anyway here's the command I'm using, and I'd appreciate any help in pinpointing this error.

echo $T | sed 's/(?(\(.*\)\(<title>Replacer</title>\)\(.*\){Article}\(.*\))(.*<!--Not:\(.*\)-->))/\1\2\3\5\4/'

where the variable T holds the contents of a file (i.e. T=`cat test.html`). Where test.html is the following:

Code:

<html>
<head>
<title>Replacer</title>
<!--Replacer:Hello Motto-->
<!--Not:Hello World-->
</head>
<body>
{Article}
</body>
</html>


crts 03-25-2011 04:43 PM

Hi,

try using another delimiter, e. g. "|"
Code:

echo $T|sed 's|(?(\(.*\)\(<title>Replacer</title>\)\(.*\){Article}\(.*\))(.*<!--Not:\(.*\)-->))|\1\2\3\5\4|'
However, I am not quite sure what you expect the output to look like. Please post some sample data of the expected output since the command above does not throw an error but it also does not do any replacement/rearrangement.

TheCrow33 03-25-2011 06:05 PM

Well that certainly stops the error, so thanks for that bit. I'm trying to make it take the file I attached in the last post and turn it into something like this:

Code:

<html>
<head>
<title>Replacer</title>
<!--Replacer:Hello Motto-->
<!--Not:Hello World-->
</head>
<body>
Hello World
</body>
</html>

But only replace {Article} if the title is Replacer. So I thought that the if part of the statement: (?(\(.*\)\(<title>Replacer</title>\)\(.*\){Article}\(.*\)) Would make sure that it found both Replacer in the title and make sure that {Article} was somewhere else in the document (not in title). And to my understanding (apparently not the correct understanding haha) if the first condition (the part in the if) is met then it moves on to the then clause (which I thought would search the document for the text within). So basically after it makes sure that Replacer is the title and {Article} is contained I want it to find the comment that starts <!--Not: and take all the text inside it: "Hello World" and place it in the place of {Article}.

Obviously my regex is a bit messed up.

David the H. 03-25-2011 06:30 PM

sed, the "stream editor", is designed for single-line edits and not well suited for multi-line work or complex conditional stuff like this. You really should use awk or perl, or even a dedicated html parsing program for this. With sed, you'd probably have to create a complex expression with the N option, nested commands, and maybe even conditional loops and use of the hold buffer. And even that wouldn't be ideal since html is a highly unstructured format.

BTW, though, you can avoid having to use most of those backslash escapes by enabling the -r "regex" option.

Here are a few useful sed references:
http://www.grymoire.com/Unix/Sed.html
http://sed.sourceforge.net/sedfaq.html
http://sed.sourceforge.net/sed1line.txt

kurumi 03-25-2011 06:49 PM

Code:

$ awk '/title/&&/Replacer/{f=1}f&&/Article/{next}1' file
Ruby(1.9+)
Code:

$ ruby -ne 'f=1 if /title/&&/Replacer/; next if f&&/Article/;print' file

crts 03-25-2011 07:17 PM

As David said, editing HTML via sed is complicated.
Here are two sed's that work with your sample data. You might have to add appropriate [[:blank:]]* in the regEx if, e. g., there were spaces like in
Code:

<!--  Not:
In case {Article} is really just one line:
Code:

sed -r '/<title>Replacer<\/title>/ {:a N;/<!--Not:/ {h;b};/\n<!--[^\n]+$/ ba};/\{Article\}/ {x;s/.*<!--Not:(.*)-->$/\1/;t;x};' file
Or if there are several lines between the <body> tags:
Code:

sed -r '/<title>Replacer<\/title>/ {:a N;/<!--Not:/ {h;b};/\n<!--[^\n]+$/ ba};/<body>/ {:b N; /<\/body>/! bb; x;s/.*<!--Not:(.*)-->$/<body>\n\1\n<\/body>/;t;x};' file
If "<!--Not" is present {Article} will be replaced, otherwise not.

If your data is not strictly arranged as in your sample then you should consider using an HTML parser.

[EDIT]
Notice, that the above sed statements edit the file. Do not echo the variable via pipe into it. They both will not work if you echo the file as a single huge line into a pipe.


All times are GMT -5. The time now is 09:04 PM.