regex HTML Character Entities

craig467 · 05-19-2008, 08:21 AM

I am looking for a regex statement that will search the string:

Code:

&lt;div style=&quot;margin-left:50;&quot;&gt;&lt;blockquote&gt;This is the text that would be the quote that would go here.&lt;/blockquote&gt;&lt;/div&gt;

and convert the HTML Character Entities (<, >, and &quot

into there respective HTML counter parts (<, >, ") for ONLY the div tags, leaving the other tags alone and keeping their HTML character entities in tact.

Can someone help please, I have no idea how to tackle this.

Thanks.

pixellany · 05-19-2008, 08:38 AM

No time at the moment to test anything....

This general form in SED will be useful:
sed 's/<\(div.*\);&gt/<\1>/' oldfile > newfile

This uses a backreference to capture everything between &lt and &gt, and then insert it between < and >. You may need to escape one or more of the special characters.

Good SEd tutorial here: http://www.grymoire.com/Unix/Sed.html

craig467 · 05-20-2008, 09:04 AM

Thanks Pixellany, I am using CGI and before your post I tried
(&lt

/*div[^(&gt

]*(&gt

as the matching string, but it only got me the closing tag. I will try you suggestion and let you know how it turns out.

craig467 · 05-20-2008, 09:06 AM

Sorry the smiley faces should be semicolons and parenthesis. I did not know how to stop that.

ntubski · 05-20-2008, 10:38 AM

There's a "Disable smilies in text" option in the advanced view. :)

gnashley · 05-20-2008, 12:09 PM

Saw this (here, I believe) the other day:

Code:

# this is a function to convert characters to their html-encodings

brackets () {
    	sed -e '
        s/</&lt;/g
        s/>/&gt;/g
        s,/,%2f,g
        s/?/%3f/g
        s/:/%3a/g
        s/@/%40/g
        s/&/%26/g
        s/=/%3d/g
        s/+/%2b/g
        s/\$/%24/g
        s/,/%2c/g
        s/ /%20/g
        '
}

osor · 05-20-2008, 06:59 PM

Quote:

Originally Posted by gnashley

Saw this (here, I believe) the other day

Unfortunately, this processes html and URI escape sequences, and is not of much use to the OP since

It will make global changes instead of only in the div tags.
It will do undesired URI filtering as well (e.g., all spaces in the original file will become %20).
It does the reverse of operation desired.