LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Insert a comment in html file based on its contents (https://www.linuxquestions.org/questions/programming-9/insert-a-comment-in-html-file-based-on-its-contents-813325/)

diaco 06-10-2010 07:23 AM

Insert a comment in html file based on its contents
 
I have multiple HTML files in a folder. there is a <h2> tag like this:
Code:

<h2>some text</h2>
in each file.
I want to write a shell script/batch file to add this tag in <head> section of each file:
Code:

<!-- TITLE= "same text from h2 tag" -->
Note that <h2>some text</h2> in some files has 1 or more linebreaks and so I couldn't capture tag content using a simple grep or...
for example:
<h2>first part of text
second line of text</h2>

The line break shouldn't be shown in <!-- TITLE= "same text from h2 tag" -->.
The script has to capture tag content & skip line breaks.
Can anybody help me?

vonbiber 06-10-2010 10:03 AM

I would go this way:
1. write a sed script:

<code>
#!/bin/sed -f

:loop
N
$!b loop

s?<h2>?¢?
s?</h2>?£?

s?^\(.*\)\(</[hH][eE][aA][dD]>[^¢]*\)¢\([^£]*\)£?\1¢\3£\2<h2>\3</h2>?

s?\(¢\)\([^\n£]*\)\n?\1\2 ?
s?\(¢\)\([^\n£]*\)\n?\1\2 ?
s?\(¢\)\([^\n£]*\)\n?\1\2 ?

s?¢?<!-- TITLE= "?
s?£?" -->?
</code>

The first 3 lines is to put the contents of the input file
in a single line so that '\n' (the new line character) could be
treated as an ordinary character.
I use the 'cent' and 'british pound' characters as delimiters
(these are unlikely to be found in an html file) to retrieve
what's between the '<h2>' and '</h2>' tags.

Then I place the contents just before the </head> tag
and surrounded by 'cent' and 'pound', replace the 'cent' and 'pound'
below (the ones that appear after </head>) by <h2> and </h2>.

The next 3 lines are for replacing the new line character by a space.

The last 2 lines would replace 'cent' and 'pound' by '<TITLE> ....' and
'...</TITLE>', respectively

All you need to do is save the sed script, eg foo.sed
then

<code>
./foo.sed your_html_file > output_html_file
</code>

Hope this'll help

theNbomr 06-10-2010 03:25 PM

This problem resolves into a problem of parsing HTML, which is a non-trivial exercise, if it is to be done well. If there is much uncertainly at all about the formatting of your HTML, it is probably worthwhile to use something like Perl and one of the existing HTML parser modules.
--- rod.

diaco 06-12-2010 02:07 AM

Quote:

Originally Posted by vonbiber (Post 3999066)
I would go this way:
1. write a sed script:

<code>
#!/bin/sed -f

:loop
N
$!b loop

s?<h2>?¢?
s?</h2>?£?

s?^\(.*\)\(</[hH][eE][aA][dD]>[^¢]*\)¢\([^£]*\)£?\1¢\3£\2<h2>\3</h2>?

s?\(¢\)\([^\n£]*\)\n?\1\2 ?
s?\(¢\)\([^\n£]*\)\n?\1\2 ?
s?\(¢\)\([^\n£]*\)\n?\1\2 ?

s?¢?<!-- TITLE= "?
s?£?" -->?
</code>

The first 3 lines is to put the contents of the input file
in a single line so that '\n' (the new line character) could be
treated as an ordinary character.
I use the 'cent' and 'british pound' characters as delimiters
(these are unlikely to be found in an html file) to retrieve
what's between the '<h2>' and '</h2>' tags.

Then I place the contents just before the </head> tag
and surrounded by 'cent' and 'pound', replace the 'cent' and 'pound'
below (the ones that appear after </head>) by <h2> and </h2>.

The next 3 lines are for replacing the new line character by a space.

The last 2 lines would replace 'cent' and 'pound' by '<TITLE> ....' and
'...</TITLE>', respectively

All you need to do is save the sed script, eg foo.sed
then

<code>
./foo.sed your_html_file > output_html_file
</code>

Hope this'll help

This one works well. thank you!


All times are GMT -5. The time now is 10:50 AM.