[SOLVED] Insert a comment in html file based on its contents

diaco · 06-10-2010, 07:23 AM

I have multiple HTML files in a folder. there is a <h2> tag like this:

Code:

<h2>some text</h2>

in each file.
I want to write a shell script/batch file to add this tag in <head> section of each file:

Code:

<!-- TITLE= "same text from h2 tag" -->

Note that <h2>some text</h2> in some files has 1 or more linebreaks and so I couldn't capture tag content using a simple grep or...
for example:
<h2>first part of text
second line of text</h2>

The line break shouldn't be shown in .
The script has to capture tag content & skip line breaks.
Can anybody help me?

vonbiber · 06-10-2010, 10:03 AM

I would go this way:
1. write a sed script:

<code>
#!/bin/sed -f

:loop
N
$!b loop

s?<h2>?¢?
s?</h2>?£?

s?^$.*$$</[hH][eE][aA][dD]>[^¢]*$¢$[^£]*$£?\1¢\3£\2<h2>\3</h2>?

s?$¢$$[^\n£]*$\n?\1\2 ?
s?$¢$$[^\n£]*$\n?\1\2 ?
s?$¢$$[^\n£]*$\n?\1\2 ?

s?¢??
</code>

The first 3 lines is to put the contents of the input file
in a single line so that '\n' (the new line character) could be
treated as an ordinary character.
I use the 'cent' and 'british pound' characters as delimiters
(these are unlikely to be found in an html file) to retrieve
what's between the '<h2>' and '</h2>' tags.

Then I place the contents just before the </head> tag
and surrounded by 'cent' and 'pound', replace the 'cent' and 'pound'
below (the ones that appear after </head>) by <h2> and </h2>.

The next 3 lines are for replacing the new line character by a space.

The last 2 lines would replace 'cent' and 'pound' by '<TITLE> ....' and
'...</TITLE>', respectively

All you need to do is save the sed script, eg foo.sed
then

<code>
./foo.sed your_html_file > output_html_file
</code>

Hope this'll help

theNbomr · 06-10-2010, 03:25 PM

This problem resolves into a problem of parsing HTML, which is a non-trivial exercise, if it is to be done well. If there is much uncertainly at all about the formatting of your HTML, it is probably worthwhile to use something like Perl and one of the existing HTML parser modules.
--- rod.

diaco · 06-12-2010, 02:07 AM

Quote:

Originally Posted by vonbiber

I would go this way:
1. write a sed script:

<code>
#!/bin/sed -f

:loop
N
$!b loop

s?<h2>?¢?
s?</h2>?£?

s?^$.*$$</[hH][eE][aA][dD]>[^¢]*$¢$[^£]*$£?\1¢\3£\2<h2>\3</h2>?

s?$¢$$[^\n£]*$\n?\1\2 ?
s?$¢$$[^\n£]*$\n?\1\2 ?
s?$¢$$[^\n£]*$\n?\1\2 ?

s?¢??
</code>

The first 3 lines is to put the contents of the input file
in a single line so that '\n' (the new line character) could be
treated as an ordinary character.
I use the 'cent' and 'british pound' characters as delimiters
(these are unlikely to be found in an html file) to retrieve
what's between the '<h2>' and '</h2>' tags.

Then I place the contents just before the </head> tag
and surrounded by 'cent' and 'pound', replace the 'cent' and 'pound'
below (the ones that appear after </head>) by <h2> and </h2>.

The next 3 lines are for replacing the new line character by a space.

The last 2 lines would replace 'cent' and 'pound' by '<TITLE> ....' and
'...</TITLE>', respectively

All you need to do is save the sed script, eg foo.sed
then

<code>
./foo.sed your_html_file > output_html_file
</code>

Hope this'll help

This one works well. thank you!