LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 01-01-2016, 12:29 PM   #1
jjonas
Member
 
Registered: Jul 2005
Location: Finland
Distribution: Arch Linux
Posts: 80

Rep: Reputation: 15
Help with sed


Hi,

I have a big HTML file where I'd need to replace this kind of stuff:

Code:
<p><span class="font12" style="italic">Word.</span></p>
<p><span class="font10" style="italic">Two words.</span></p>
<p><span class="font8">Several words, some of which are italicised,</span><span class="font8" style="italic">but not all of them.</span></p>
<p><span class="font12">These words shouldn't be italicised.</span></p>
with this kind of stuff:

Code:
<p><i>Word.</i></p>
<p><i>Two words.</i></p>
<p>Several words, some of which are italicised, <i>but not all of them.</i></p>
<p>These words shouldn't be italicised.</p>
With the sed commands
sed 's|<span class="font[0-9][0-9]*" style="italic">|<i>|g'
sed 's|<i>\([^ ][^ ]*\)</span>|<i>\1</i>|g'
sed 's|<span class="font[0-9][0-9]*">||g'
..I get:

Code:
<p><i>Word.</i></p>
<p><i>Two words.</span></p>
<p>Several words, some of which are italicised,</span><i>but not all of them.</span></p>
<p>These words shouldn't be italicised.</span></p>
..which is almost what I want. But I can't simply replace all of the </span>'s with nothing, because that will lose the endpoints of the italicisation.

If I change the second command into this:
sed 's|<i>\([^ ][^ ]*\) \([^ ][^ ]*\)</span>|<i>\1 \2</i>|g'
..I can get the second line of the original file right instead of the first, and I guess I could have up to nine separate commands in a sed script file to cover italicised sentences with 1-9 words (nine being the sed remembered patterns maximum), but is there a more elegant way so that I could use a single command to have sed look for each occurrence of <i>content</span>, and replace that with <i>content</i>..?
 
Old 01-01-2016, 07:07 PM   #2
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,127

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Trying to mangle html with sed gets ugly - as you've found. You have to be very specific, and know all the variations in layout.
Instead of looking for non blank - [^ ], look for [^<]+
Code:
sed 's|\(<i>[^<]+\)</span>|\1</i>|g' html.input
(untested)

Last edited by syg00; 01-01-2016 at 07:10 PM. Reason: syntax - like I said, untested
 
1 members found this post helpful.
Old 01-02-2016, 03:50 AM   #3
jjonas
Member
 
Registered: Jul 2005
Location: Finland
Distribution: Arch Linux
Posts: 80

Original Poster
Rep: Reputation: 15
Ok, I tested searching for [^<][^<]* with a real HTML file and it seems to do the trick!

Just for background information, I'm using this on an HTML file which is the product of an OCR (optical character recognition) program which reads scanned book pages (images) and transforms them in to text files. I will have to proof-read them anyway, and if there's some rubbish HTML left, it's probably a lot less than all the image-to-text transformation mistakes. So the sed operation to destroy rubbish HTML doesn't have to be perfect.

There's one thing I'd like to ask about the logic with which sed replaces patterns, in order to plan the script file better. If I have the following command:

s|<i>\([^<][^<]*\)</span>|<i>\1</i>|g

..does sed proceed in the following sequence:

1) it looks for a <
2) if it finds one, then its checks if it's followed by an i, and then a >
3) if not, it goes back to the beginning. if yes, then it checks for at least one character of anything except a <
4) ..and keeps checking until it finds a < (guaranteed, because every line ends with </p>.)
5) if it does, it now remembers the pattern that started with the first < – the one from the <i> expression – (including the <) and goes all the way up to (but excluding) the next <
6) then it checks whether the next character is a < (from the </span> expression), and whether it's followed by /, s, p, a, n, >.
7) if any of these checks fail, it goes back to the beginning; if yes, it has now found everything it was asked to look for.
8) it now replaces the whole pattern it found with <i>[pattern1]</i>
9) then it goes back to the beginning and starts to look for the next occurrence of <i>\([^<][^<]*\)</span> on every row of the file.

The thing I'm trying to gauge with this question is what the risk of accidentally replacing "real text" with this is. If sed proceeds in the way I imagine (the sequence above), it doesn't seem like there's risk of losing crucial information (i.e. actual text, not HTML), because sed will stop at the first < after it has found the first character of the searched-for pattern (which incidentally happens to be a < as well). So even if the actual text does have a < in it (very unlikely, as in an HTML file it would be written &lt;) the outcome in terms of real text won't be affected, because it's highly unlikely that the real text would have /span> follow the already highly unlikely non-HTML < in it, the < that stops the potentially dangerous "look for anything but X" part of the command.
 
Old 01-02-2016, 08:13 AM   #4
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
Quote:
Originally Posted by syg00 View Post
Trying to mangle html with sed gets ugly - as you've found.
i found this to be very true.
fwiw, i've been using xmllint (part of libxml*) very succesfully, after messing around with sed for a long time, but my use case is a little different.
maybe there's a dedicated tool for rewriting html documents?

maybe the "replace" command, wrapped in a shell script, can help? it is not versatile, but much simpler to use.

please provide us an example html document, i can help test it.
 
Old 01-02-2016, 08:48 AM   #5
jjonas
Member
 
Registered: Jul 2005
Location: Finland
Distribution: Arch Linux
Posts: 80

Original Poster
Rep: Reputation: 15
I've attached two files:

1) an HTML file, which is the HTML output of ABBYY Finereader (an OCR program). The text is that of a Finnish socialist periodical from 1906 (Sosialistinen aikakauslehti, issue 8/1906), so I don't expect most people to make sense of it, but unfortunately I don't have anything scanned in English (as I work only on Finnish texts). I haven't used sed on it, but IIRC I've "Find & Replaced" HTML umlauts with ä and ö, and &raquo; with »;

2) the sed script file which I've tested and run on the mentioned HTML file (as well as others).

At the moment I'm proof-reading the outcome of the sed'ed version of the attached HTML file, and as far as I can tell, all italics are preserved, and almost all of the rubbish HTML is gone. In practice that means that unless someone points out a major flaw in the script file concept, I think I'll stick with this solution because it appears to do almost everything I want now.
Attached Files
File Type: txt sos_aikakauslehti_8nro.htm.txt (108.7 KB, 8 views)
File Type: txt sed-komento.txt (789 Bytes, 8 views)
 
Old 01-03-2016, 04:59 AM   #6
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
the first mention of "SISÄLTÖ" is not italic in the "sedded" file.
some bold styles are not bold anymore.

i'm not sure what you are trying to achieve, so i'm just telling you.
what do you want in the end?
there's still a LOT of manual correction to do...

apart from that, if you're comfortable doing this with sed, why use replace.

ps: terveisiä turkuun!
 
Old 01-03-2016, 05:41 AM   #7
jjonas
Member
 
Registered: Jul 2005
Location: Finland
Distribution: Arch Linux
Posts: 80

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by ondoho View Post
the first mention of "SISÄLTÖ" is not italic in the "sedded" file.
some bold styles are not bold anymore.
Thanks for noticing the SISÄLTÖ! That was simply my confused thinking on the second line of the sed script file: it wiped out bold italics, when it should have kept the italics and wiped out the bold. I've corrected this now. As to the bold styles, they are wiped out by design. Bold styles are mostly used in headlines, and without going into the details, it's easier for my purposes that they're wiped out.

Quote:
Originally Posted by ondoho View Post
i'm not sure what you are trying to achieve, so i'm just telling you. what do you want in the end? there's still a LOT of manual correction to do...
I'm publishing old socialist texts of all varieties on sosialismi.net as pdf (text-searchable instead of just images) and on Marxists Internet Archive as HTML. Without intending to start any off-topic follow-ups, here's a link to a sample pdf with a layout that copies that of the original:
http://sosialismi.net/wp-content/upl...06_numero3.pdf

It's made with Scribus. Styles of imported HTML files are preserved (most importantly the italics, which some authors use a lot), so the sed'ed and proof-read "simple HTML" file can be used as the basis of both pdf and HTML publications. Pdf is good for download to be read and annotated on a tablet, while HTML is kb instead of Mb in terms of size and good for a quick search online.

Quote:
Originally Posted by ondoho View Post
ps: terveisiä turkuun!
mimmottos kummottos!

I'm waiting for a few more days if someone can still answer my question on the logic of how sed replaces stuff (3rd message), but then I'll marked this one as solved. Thanks for all the answers!
 
Old 01-03-2016, 06:35 AM   #8
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,848

Rep: Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309Reputation: 7309
I do not really understand your problem:
Code:
sed -r
s                 # this is the substitute command
!                 # search beginning
<span             # first keyword
[^<>]+            # something, excluding < and >
style="italic">   # second keyword
([^<>]+)          # the text we are looking for, grouped
</span>           # last keyword
!                 # search end
<i>\1</i>         # replacement text
!g                # global replacement
i did not check it and probably you need to fix some typos, but actually this logic should work.
But as it was mentioned sed is not the proper way to do that.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
create file list: SED inline vs SED standalone, enormous speed difference Corsari Linux - Newbie 4 09-02-2013 03:01 AM
[SOLVED] Multipal line edited using sed, how to make sed specific coolpraz Programming 4 01-05-2013 01:14 PM
[SOLVED] sed help to run sed command against multiple different file names bkone Programming 2 04-16-2012 12:27 PM
[SOLVED] sed 's/Tb05.5K5.100/Tb229/' alone but doesn't work in sed file w/ other expressions Radha.jg Programming 6 03-03-2011 07:59 AM
Insert character into a line with sed? & variables in sed? jago25_98 Programming 5 03-11-2004 06:12 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 03:23 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration