Regular expression to match unspecified number of characters until a '>'
Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Regular expression to match unspecified number of characters until a '>'
I'm attempting to search through a rather large assortment of html files created in Word using 'save as html'. Ugh. Specifically, what I'm trying to do is find and delete these tags (they're causing browsers to display black diamonds with white question marks):
<span style='mso-spacerun:yes'> </span>
Tags contain from 1 to 4 spaces between opening and closing. I get positive results from this:
I've been nose-deep in regex tutorials for the past day or so, and I'm still not understanding why this doesn't work. If I put the pattern (without backslashes) into a separate file and use `grep -f patternfile filename.html`, I get no error but no matches either. So far as I can figure, the above regex boils down to:
Match the string "<span style='mso-spacerun:yes'>", followed by any number of characters that are not a ">", followed by a ">".
If someone could tell me where I'm going wrong with this, it'd be much appreciated.
[ and ] (and spaces for that matter) are also used by the shell itself, so they need to be protected. Try enclosing the expression in double quotes (can't use single quotes, since the string itself has them). Also, when dealing with complex strings like this, it's usually easier to run grep with regex enabled so you can avoid all the backslashing.
I had tried the double quotes surrounding the expression, which got me no results at all, so I figured that it was looking for the double quotes themselves as part of the string. I've also made all my attempts with and without the -E switch and haven't noticed a difference. Backslash-escaping, as in the first pattern,
still got me an 'invalid regular expression' error, which led me to believe I was giving wrong syntax when including the [^>]+> bit at the end.
Sed is my plan once I concoct the right regex that I need. The example you gave didn't work, I'm wondering if it's because the characters enclosed in the span tag aren't showing properly, they display as question marks within diamonds...but they're still counted as characters that don't match ">", even if they don't display correctly, right?
The line works perfectly for me using a test file when quoted as I showed. The outer double-quotes escape the inner single-quotes, so that's not a problem. They protect the contents from the shell long enough to pass them to grep, and are consumed in the process, so no, they shouldn't be interfering with the expression itself.
The only thing I can suggest is to look very closely at the what you're trying to match. Is that space really a standard ascii space, for example?
But thinking about it a bit more, those "black diamonds" are usually a sign that the encoding of the file is different from that of the display program. I'd lay good odds that they're in the Windows cp1252 encoding. If you run them through iconv or open-convert-resave to utf-8 in a decent editor like kwrite or gedit, you might not even have to worry about grep anymore.
(Of course you could just change the html encoding header instead, but utf-8 is the way to go these days. )
PS: don't forget that Windows also has a different line-ending code, which should be changed if you plan on using unix tools on them. Check out tofrodos or similar programs.
Last edited by David the H.; 07-07-2010 at 05:42 PM.
Ah ok, I created a small file with several of the tags I'm looking to match, with varying amounts of whitespace contained within them. Your grep and sed commands work just as expected, which means the diamond-question-mark unknown characters are throwing things off.
Is there a way to account for such unknown characters using regular expressions?
I've tried changing the html encoding header to utf-8 from iso-8859-1, but that doesn't change anything. If I use iconv to convert them to utf-8 from iso-8859-1, iconv succeeds without complaint, but the display in the browser is the same.
There's nothing in the regex I see that could be affected by the nature of the characters in question, so I doubt highly that they are "throwing it off". The problem is undoubtedly the encoding used, and perhaps the line terminators as well.
Could you please post some actual commands and output? First try "cat" and "cat -A" and give us an example of the html with the offending text. Then try copying some of the browser output and do the same thing.
Did you try converting from CP-1252 as I suggested? I've repeatedly come across text files with undisplayable characters (usually text with print-format versions of quotation marks, hyphens, letters with diacriticals, etc.) that were created on Windows computers, and these almost inevitably turn out to be CP1252 encoded with dos line terminators. Note that ISO-8859-1 and CP1252 are very similar, but not identical, and the later are often mislabeled as the being in the former.
So please try the following:
Code:
iconv -f CP1252 -t UTF8 oldfile >newfile
sed -i 's/.$//' newfile #or use fromdos, flip or similar
If that doesn't work, try some other encodings.
Actually, what I usually do is open the files up in kwrite, and flip through it's encoding-selection menu until I find the one that displays everything correctly, then I can simply "save-as" to UTF-8. Your browser should also have an encoding selector which you could use to determine the encoding that's actually being used.
Last edited by David the H.; 07-09-2010 at 08:40 AM.
I ran a couple of tests with text files in various formats, adding a few non-ascii characters to the mix. And both grep and sed could handle the utf-8 files just fine, but failed or had unexpected effects when the files were in ISO-8859 or CP1252. The line endings didn't affect them much in this case, however.
So, yeah, it looks like you need to convert them to utf-8 before editing, whether or not you get the characters themselves to display properly.
HAH. Ok, one of my core problems was that I hadn't noticed that I was using iconv wrong. I wasn't designating an output file, so all it was doing was spitting the UTF-8 encoded material to the terminal.
I took one of the files in question and converted it correctly to UTF-8, and it displays without problem in the browser. Thanks for all your help, apparently I have no need for the regex and converting to UTF-8 will do just fine.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.