[SOLVED] Help with sed

jjonas · 01-01-2016, 12:29 PM

Hi,

I have a big HTML file where I'd need to replace this kind of stuff:

Code:

<p><span class="font12" style="italic">Word.</span></p>
<p><span class="font10" style="italic">Two words.</span></p>
<p><span class="font8">Several words, some of which are italicised,</span><span class="font8" style="italic">but not all of them.</span></p>
<p><span class="font12">These words shouldn't be italicised.</span></p>

with this kind of stuff:

Code:

<p><i>Word.</i></p>
<p><i>Two words.</i></p>
<p>Several words, some of which are italicised, <i>but not all of them.</i></p>
<p>These words shouldn't be italicised.</p>

With the sed commands

sed 's|||g'
sed 's|\([^ ][^ ]*\)|\1|g'
sed 's|||g'

..I get:

Code:

<p><i>Word.</i></p>
<p><i>Two words.</span></p>
<p>Several words, some of which are italicised,</span><i>but not all of them.</span></p>
<p>These words shouldn't be italicised.</span></p>

..which is almost what I want. But I can't simply replace all of the 's with nothing, because that will lose the endpoints of the italicisation.

If I change the second command into this:

sed 's|\([^ ][^ ]*\) \([^ ][^ ]*\)|\1 \2|g'

..I can get the second line of the original file right instead of the first, and I guess I could have up to nine separate commands in a sed script file to cover italicised sentences with 1-9 words (nine being the sed remembered patterns maximum), but is there a more elegant way so that I could use a single command to have sed look for each occurrence of content, and replace that with content..?

syg00 · 01-01-2016, 07:07 PM

Trying to mangle html with sed gets ugly - as you've found. You have to be very specific, and know all the variations in layout.
Instead of looking for non blank - [^ ], look for [^<]+

Code:

sed 's|\(<i>[^<]+\)</span>|\1</i>|g' html.input

(untested)

jjonas · 01-02-2016, 03:50 AM

Ok, I tested searching for [^<][^<]* with a real HTML file and it seems to do the trick!

Just for background information, I'm using this on an HTML file which is the product of an OCR (optical character recognition) program which reads scanned book pages (images) and transforms them in to text files. I will have to proof-read them anyway, and if there's some rubbish HTML left, it's probably a lot less than all the image-to-text transformation mistakes. So the sed operation to destroy rubbish HTML doesn't have to be perfect.

There's one thing I'd like to ask about the logic with which sed replaces patterns, in order to plan the script file better. If I have the following command:

s|\([^<][^<]*\)|\1|g

..does sed proceed in the following sequence:

1) it looks for a <
2) if it finds one, then its checks if it's followed by an i, and then a >
3) if not, it goes back to the beginning. if yes, then it checks for at least one character of anything except a <
4) ..and keeps checking until it finds a < (guaranteed, because every line ends with .)
5) if it does, it now remembers the pattern that started with the first < – the one from the  expression – (including the <) and goes all the way up to (but excluding) the next <
6) then it checks whether the next character is a < (from the  expression), and whether it's followed by /, s, p, a, n, >.
7) if any of these checks fail, it goes back to the beginning; if yes, it has now found everything it was asked to look for.
8) it now replaces the whole pattern it found with [pattern1]
9) then it goes back to the beginning and starts to look for the next occurrence of \([^<][^<]*\) on every row of the file.

The thing I'm trying to gauge with this question is what the risk of accidentally replacing "real text" with this is. If sed proceeds in the way I imagine (the sequence above), it doesn't seem like there's risk of losing crucial information (i.e. actual text, not HTML), because sed will stop at the first < after it has found the first character of the searched-for pattern (which incidentally happens to be a < as well). So even if the actual text does have a < in it (very unlikely, as in an HTML file it would be written <) the outcome in terms of real text won't be affected, because it's highly unlikely that the real text would have /span> follow the already highly unlikely non-HTML < in it, the < that stops the potentially dangerous "look for anything but X" part of the command.

ondoho · 01-02-2016, 08:13 AM

Quote:

Originally Posted by syg00

Trying to mangle html with sed gets ugly - as you've found.

i found this to be very true.
fwiw, i've been using xmllint (part of libxml*) very succesfully, after messing around with sed for a long time, but my use case is a little different.
maybe there's a dedicated tool for rewriting html documents?

maybe the "replace" command, wrapped in a shell script, can help? it is not versatile, but much simpler to use.

please provide us an example html document, i can help test it.

jjonas · 01-02-2016, 08:48 AM

I've attached two files:

1) an HTML file, which is the HTML output of ABBYY Finereader (an OCR program). The text is that of a Finnish socialist periodical from 1906 (Sosialistinen aikakauslehti, issue 8/1906), so I don't expect most people to make sense of it, but unfortunately I don't have anything scanned in English (as I work only on Finnish texts). I haven't used sed on it, but IIRC I've "Find & Replaced" HTML umlauts with ä and ö, and » with »;

2) the sed script file which I've tested and run on the mentioned HTML file (as well as others).

At the moment I'm proof-reading the outcome of the sed'ed version of the attached HTML file, and as far as I can tell, all italics are preserved, and almost all of the rubbish HTML is gone. In practice that means that unless someone points out a major flaw in the script file concept, I think I'll stick with this solution because it appears to do almost everything I want now.

ondoho · 01-03-2016, 04:59 AM

the first mention of "SISÄLTÖ" is not italic in the "sedded" file.
some bold styles are not bold anymore.

i'm not sure what you are trying to achieve, so i'm just telling you.
what do you want in the end?
there's still a LOT of manual correction to do...

apart from that, if you're comfortable doing this with sed, why use replace.

ps: terveisiä turkuun!

jjonas · 01-03-2016, 05:41 AM

Quote:

Originally Posted by ondoho

the first mention of "SISÄLTÖ" is not italic in the "sedded" file.
some bold styles are not bold anymore.

Thanks for noticing the SISÄLTÖ! That was simply my confused thinking on the second line of the sed script file: it wiped out bold italics, when it should have kept the italics and wiped out the bold. I've corrected this now. As to the bold styles, they are wiped out by design. Bold styles are mostly used in headlines, and without going into the details, it's easier for my purposes that they're wiped out.

Quote:

Originally Posted by ondoho

i'm not sure what you are trying to achieve, so i'm just telling you. what do you want in the end? there's still a LOT of manual correction to do...

I'm publishing old socialist texts of all varieties on sosialismi.net as pdf (text-searchable instead of just images) and on Marxists Internet Archive as HTML. Without intending to start any off-topic follow-ups, here's a link to a sample pdf with a layout that copies that of the original:
http://sosialismi.net/wp-content/upl...06_numero3.pdf

It's made with Scribus. Styles of imported HTML files are preserved (most importantly the italics, which some authors use a lot), so the sed'ed and proof-read "simple HTML" file can be used as the basis of both pdf and HTML publications. Pdf is good for download to be read and annotated on a tablet, while HTML is kb instead of Mb in terms of size and good for a quick search online.

Quote:

Originally Posted by ondoho

ps: terveisiä turkuun!

mimmottos kummottos!

I'm waiting for a few more days if someone can still answer my question on the logic of how sed replaces stuff (3rd message), but then I'll marked this one as solved. Thanks for all the answers!

pan64 · 01-03-2016, 06:35 AM

I do not really understand your problem:

Code:

sed -r
s                 # this is the substitute command
!                 # search beginning
<span             # first keyword
[^<>]+            # something, excluding < and >
style="italic">   # second keyword
([^<>]+)          # the text we are looking for, grouped
</span>           # last keyword
!                 # search end
<i>\1</i>         # replacement text
!g                # global replacement

i did not check it and probably you need to fix some typos, but actually this logic should work.
But as it was mentioned sed is not the proper way to do that.