LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Reply
 
Search this Thread
Old 07-07-2010, 11:44 AM   #1
cygnal
LQ Newbie
 
Registered: May 2007
Distribution: Slackware/Debian
Posts: 26

Rep: Reputation: 15
Regular expression to match unspecified number of characters until a '>'


I'm attempting to search through a rather large assortment of html files created in Word using 'save as html'. Ugh. Specifically, what I'm trying to do is find and delete these tags (they're causing browsers to display black diamonds with white question marks):

<span style='mso-spacerun:yes'> </span>

Tags contain from 1 to 4 spaces between opening and closing. I get positive results from this:

grep \<span\ style=\'mso-spacerun:yes\'\> filename.html

but once I attempt to tell it to match any number of characters up until the next '>' symbol, it tells me I'm using an invalid regex:

grep \<span\ style=\'mso-spacerun:yes\'\>[^>]+\> filename.html

I've been nose-deep in regex tutorials for the past day or so, and I'm still not understanding why this doesn't work. If I put the pattern (without backslashes) into a separate file and use `grep -f patternfile filename.html`, I get no error but no matches either. So far as I can figure, the above regex boils down to:
Match the string "<span style='mso-spacerun:yes'>", followed by any number of characters that are not a ">", followed by a ">".

If someone could tell me where I'm going wrong with this, it'd be much appreciated.
 
Old 07-07-2010, 12:17 PM   #2
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946
[ and ] (and spaces for that matter) are also used by the shell itself, so they need to be protected. Try enclosing the expression in double quotes (can't use single quotes, since the string itself has them). Also, when dealing with complex strings like this, it's usually easier to run grep with regex enabled so you can avoid all the backslashing.
Code:
grep -E "<span style='mso-spacerun:yes'>[^>]+>" filename.html
But if you really want to remove them from the file itself, use sed instead which can edit the file in place (as well as make an optional backup).
Code:
sed -i.bkup -r "s|<span style='mso-spacerun:yes'>[^>]+>||g"
| is used instead of / since it's also a common html character.

Finally, be aware that none of these will work if the pattern spans multiple lines.

Last edited by David the H.; 07-07-2010 at 12:23 PM. Reason: added explanation and fixed formatting error
 
Old 07-07-2010, 03:34 PM   #3
cygnal
LQ Newbie
 
Registered: May 2007
Distribution: Slackware/Debian
Posts: 26

Original Poster
Rep: Reputation: 15
Thanks for the help, David.

I had tried the double quotes surrounding the expression, which got me no results at all, so I figured that it was looking for the double quotes themselves as part of the string. I've also made all my attempts with and without the -E switch and haven't noticed a difference. Backslash-escaping, as in the first pattern,

grep \<span\ style=\'mso-spacerun:yes\'\> filename.html

gives the matches that I know are in the file I'm working with, but escaping the square brackets

grep \<span\ style=\'mso-spacerun:yes\'\>\[^>\]+\> filename.html

still got me an 'invalid regular expression' error, which led me to believe I was giving wrong syntax when including the [^>]+> bit at the end.

Sed is my plan once I concoct the right regex that I need. The example you gave didn't work, I'm wondering if it's because the characters enclosed in the span tag aren't showing properly, they display as question marks within diamonds...but they're still counted as characters that don't match ">", even if they don't display correctly, right?
 
Old 07-07-2010, 05:39 PM   #4
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946
The line works perfectly for me using a test file when quoted as I showed. The outer double-quotes escape the inner single-quotes, so that's not a problem. They protect the contents from the shell long enough to pass them to grep, and are consumed in the process, so no, they shouldn't be interfering with the expression itself.

The only thing I can suggest is to look very closely at the what you're trying to match. Is that space really a standard ascii space, for example?

But thinking about it a bit more, those "black diamonds" are usually a sign that the encoding of the file is different from that of the display program. I'd lay good odds that they're in the Windows cp1252 encoding. If you run them through iconv or open-convert-resave to utf-8 in a decent editor like kwrite or gedit, you might not even have to worry about grep anymore.

(Of course you could just change the html encoding header instead, but utf-8 is the way to go these days. )

PS: don't forget that Windows also has a different line-ending code, which should be changed if you plan on using unix tools on them. Check out tofrodos or similar programs.

Last edited by David the H.; 07-07-2010 at 05:42 PM.
 
Old 07-08-2010, 10:25 AM   #5
cygnal
LQ Newbie
 
Registered: May 2007
Distribution: Slackware/Debian
Posts: 26

Original Poster
Rep: Reputation: 15
Ah ok, I created a small file with several of the tags I'm looking to match, with varying amounts of whitespace contained within them. Your grep and sed commands work just as expected, which means the diamond-question-mark unknown characters are throwing things off.

Is there a way to account for such unknown characters using regular expressions?

I've tried changing the html encoding header to utf-8 from iso-8859-1, but that doesn't change anything. If I use iconv to convert them to utf-8 from iso-8859-1, iconv succeeds without complaint, but the display in the browser is the same.
 
Old 07-09-2010, 08:39 AM   #6
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946
There's nothing in the regex I see that could be affected by the nature of the characters in question, so I doubt highly that they are "throwing it off". The problem is undoubtedly the encoding used, and perhaps the line terminators as well.

Could you please post some actual commands and output? First try "cat" and "cat -A" and give us an example of the html with the offending text. Then try copying some of the browser output and do the same thing.

Did you try converting from CP-1252 as I suggested? I've repeatedly come across text files with undisplayable characters (usually text with print-format versions of quotation marks, hyphens, letters with diacriticals, etc.) that were created on Windows computers, and these almost inevitably turn out to be CP1252 encoded with dos line terminators. Note that ISO-8859-1 and CP1252 are very similar, but not identical, and the later are often mislabeled as the being in the former.

So please try the following:
Code:
iconv -f CP1252 -t UTF8 oldfile >newfile

sed -i 's/.$//' newfile   #or use fromdos, flip or similar
If that doesn't work, try some other encodings.

Actually, what I usually do is open the files up in kwrite, and flip through it's encoding-selection menu until I find the one that displays everything correctly, then I can simply "save-as" to UTF-8. Your browser should also have an encoding selector which you could use to determine the encoding that's actually being used.

Last edited by David the H.; 07-09-2010 at 08:40 AM.
 
Old 07-09-2010, 08:59 AM   #7
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946Reputation: 1946
I ran a couple of tests with text files in various formats, adding a few non-ascii characters to the mix. And both grep and sed could handle the utf-8 files just fine, but failed or had unexpected effects when the files were in ISO-8859 or CP1252. The line endings didn't affect them much in this case, however.

So, yeah, it looks like you need to convert them to utf-8 before editing, whether or not you get the characters themselves to display properly.
 
Old 07-15-2010, 12:21 PM   #8
cygnal
LQ Newbie
 
Registered: May 2007
Distribution: Slackware/Debian
Posts: 26

Original Poster
Rep: Reputation: 15
HAH. Ok, one of my core problems was that I hadn't noticed that I was using iconv wrong. I wasn't designating an output file, so all it was doing was spitting the UTF-8 encoded material to the terminal.

I took one of the files in question and converted it correctly to UTF-8, and it displays without problem in the browser. Thanks for all your help, apparently I have no need for the regex and converting to UTF-8 will do just fine.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
sed regular expression match everything up to a certain character bhepdogg Programming 3 05-28-2009 02:59 PM
Regular expression to match "^" then a number? PsychosisNode Linux - Newbie 1 01-14-2007 09:26 AM
Regular expression to match 4 or more alpha characters sixerjman Programming 15 11-27-2006 12:03 AM
Don't match a regular expression dakensta Programming 7 09-21-2006 03:48 AM
perl regular expression a char match richikiki Programming 8 07-19-2006 03:37 AM


All times are GMT -5. The time now is 03:57 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration