Regular expression to match unspecified number of characters until a '>'

cygnal · 07-07-2010, 11:44 AM

I'm attempting to search through a rather large assortment of html files created in Word using 'save as html'. Ugh. Specifically, what I'm trying to do is find and delete these tags (they're causing browsers to display black diamonds with white question marks):

<span style='mso-spacerun:yes'> </span>

Tags contain from 1 to 4 spaces between opening and closing. I get positive results from this:

grep \<span\ style=\'mso-spacerun:yes\'\> filename.html

but once I attempt to tell it to match any number of characters up until the next '>' symbol, it tells me I'm using an invalid regex:

grep \<span\ style=\'mso-spacerun:yes\'\>[^>]+\> filename.html

I've been nose-deep in regex tutorials for the past day or so, and I'm still not understanding why this doesn't work. If I put the pattern (without backslashes) into a separate file and use `grep -f patternfile filename.html`, I get no error but no matches either. So far as I can figure, the above regex boils down to:
Match the string "<span style='mso-spacerun:yes'>", followed by any number of characters that are not a ">", followed by a ">".

If someone could tell me where I'm going wrong with this, it'd be much appreciated.

David the H. · 07-07-2010, 12:17 PM

[ and ] (and spaces for that matter) are also used by the shell itself, so they need to be protected. Try enclosing the expression in double quotes (can't use single quotes, since the string itself has them). Also, when dealing with complex strings like this, it's usually easier to run grep with regex enabled so you can avoid all the backslashing.

Code:

grep -E "<span style='mso-spacerun:yes'>[^>]+>" filename.html

But if you really want to remove them from the file itself, use sed instead which can edit the file in place (as well as make an optional backup).

Code:

sed -i.bkup -r "s|<span style='mso-spacerun:yes'>[^>]+>||g"

| is used instead of / since it's also a common html character.

Finally, be aware that none of these will work if the pattern spans multiple lines.

cygnal · 07-07-2010, 03:34 PM

Thanks for the help, David.

I had tried the double quotes surrounding the expression, which got me no results at all, so I figured that it was looking for the double quotes themselves as part of the string. I've also made all my attempts with and without the -E switch and haven't noticed a difference. Backslash-escaping, as in the first pattern,

grep \<span\ style=\'mso-spacerun:yes\'\> filename.html

gives the matches that I know are in the file I'm working with, but escaping the square brackets

grep \<span\ style=\'mso-spacerun:yes\'\>\[^>\]+\> filename.html

still got me an 'invalid regular expression' error, which led me to believe I was giving wrong syntax when including the [^>]+> bit at the end.

Sed is my plan once I concoct the right regex that I need. The example you gave didn't work, I'm wondering if it's because the characters enclosed in the span tag aren't showing properly, they display as question marks within diamonds...but they're still counted as characters that don't match ">", even if they don't display correctly, right?

David the H. · 07-07-2010, 05:39 PM

The line works perfectly for me using a test file when quoted as I showed. The outer double-quotes escape the inner single-quotes, so that's not a problem. They protect the contents from the shell long enough to pass them to grep, and are consumed in the process, so no, they shouldn't be interfering with the expression itself.

The only thing I can suggest is to look very closely at the what you're trying to match. Is that space really a standard ascii space, for example?

But thinking about it a bit more, those "black diamonds" are usually a sign that the encoding of the file is different from that of the display program. I'd lay good odds that they're in the Windows cp1252 encoding. If you run them through iconv or open-convert-resave to utf-8 in a decent editor like kwrite or gedit, you might not even have to worry about grep anymore.

(Of course you could just change the html encoding header instead, but utf-8 is the way to go these days.

)

PS: don't forget that Windows also has a different line-ending code, which should be changed if you plan on using unix tools on them. Check out tofrodos or similar programs.

cygnal · 07-08-2010, 10:25 AM

Ah ok, I created a small file with several of the tags I'm looking to match, with varying amounts of whitespace contained within them. Your grep and sed commands work just as expected, which means the diamond-question-mark unknown characters are throwing things off.

Is there a way to account for such unknown characters using regular expressions?

I've tried changing the html encoding header to utf-8 from iso-8859-1, but that doesn't change anything. If I use iconv to convert them to utf-8 from iso-8859-1, iconv succeeds without complaint, but the display in the browser is the same.

David the H. · 07-09-2010, 08:39 AM

There's nothing in the regex I see that could be affected by the nature of the characters in question, so I doubt highly that they are "throwing it off". The problem is undoubtedly the encoding used, and perhaps the line terminators as well.

Could you please post some actual commands and output? First try "cat" and "cat -A" and give us an example of the html with the offending text. Then try copying some of the browser output and do the same thing.

Did you try converting from CP-1252 as I suggested? I've repeatedly come across text files with undisplayable characters (usually text with print-format versions of quotation marks, hyphens, letters with diacriticals, etc.) that were created on Windows computers, and these almost inevitably turn out to be CP1252 encoded with dos line terminators. Note that ISO-8859-1 and CP1252 are very similar, but not identical, and the later are often mislabeled as the being in the former.

So please try the following:

Code:

iconv -f CP1252 -t UTF8 oldfile >newfile

sed -i 's/.$//' newfile   #or use fromdos, flip or similar

If that doesn't work, try some other encodings.

Actually, what I usually do is open the files up in kwrite, and flip through it's encoding-selection menu until I find the one that displays everything correctly, then I can simply "save-as" to UTF-8. Your browser should also have an encoding selector which you could use to determine the encoding that's actually being used.

David the H. · 07-09-2010, 08:59 AM

I ran a couple of tests with text files in various formats, adding a few non-ascii characters to the mix. And both grep and sed could handle the utf-8 files just fine, but failed or had unexpected effects when the files were in ISO-8859 or CP1252. The line endings didn't affect them much in this case, however.

So, yeah, it looks like you need to convert them to utf-8 before editing, whether or not you get the characters themselves to display properly.

cygnal · 07-15-2010, 12:21 PM

HAH. Ok, one of my core problems was that I hadn't noticed that I was using iconv wrong. I wasn't designating an output file, so all it was doing was spitting the UTF-8 encoded material to the terminal.

I took one of the files in question and converted it correctly to UTF-8, and it displays without problem in the browser. Thanks for all your help, apparently I have no need for the regex and converting to UTF-8 will do just fine.