LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-10-2010, 07:23 AM   #1
diaco
LQ Newbie
 
Registered: Oct 2009
Location: In a one-bed hotel room
Distribution: Ubuntu Karmic
Posts: 7

Rep: Reputation: 0
Insert a comment in html file based on its contents


I have multiple HTML files in a folder. there is a <h2> tag like this:
Code:
<h2>some text</h2>
in each file.
I want to write a shell script/batch file to add this tag in <head> section of each file:
Code:
<!-- TITLE= "same text from h2 tag" -->
Note that <h2>some text</h2> in some files has 1 or more linebreaks and so I couldn't capture tag content using a simple grep or...
for example:
<h2>first part of text
second line of text</h2>

The line break shouldn't be shown in <!-- TITLE= "same text from h2 tag" -->.
The script has to capture tag content & skip line breaks.
Can anybody help me?

Last edited by diaco; 06-10-2010 at 07:47 AM.
 
Old 06-10-2010, 10:03 AM   #2
vonbiber
Member
 
Registered: Apr 2009
Distribution: slackware 14.1 64-bit, slackware 14.2 64-bit, SystemRescueCD
Posts: 533

Rep: Reputation: 129Reputation: 129
I would go this way:
1. write a sed script:

<code>
#!/bin/sed -f

:loop
N
$!b loop

s?<h2>?¢?
s?</h2>?£?

s?^\(.*\)\(</[hH][eE][aA][dD]>[^¢]*\)¢\([^£]*\)£?\1¢\3£\2<h2>\3</h2>?

s?\(¢\)\([^\n£]*\)\n?\1\2 ?
s?\(¢\)\([^\n£]*\)\n?\1\2 ?
s?\(¢\)\([^\n£]*\)\n?\1\2 ?

s?¢?<!-- TITLE= "?
s?£?" -->?
</code>

The first 3 lines is to put the contents of the input file
in a single line so that '\n' (the new line character) could be
treated as an ordinary character.
I use the 'cent' and 'british pound' characters as delimiters
(these are unlikely to be found in an html file) to retrieve
what's between the '<h2>' and '</h2>' tags.

Then I place the contents just before the </head> tag
and surrounded by 'cent' and 'pound', replace the 'cent' and 'pound'
below (the ones that appear after </head>) by <h2> and </h2>.

The next 3 lines are for replacing the new line character by a space.

The last 2 lines would replace 'cent' and 'pound' by '<TITLE> ....' and
'...</TITLE>', respectively

All you need to do is save the sed script, eg foo.sed
then

<code>
./foo.sed your_html_file > output_html_file
</code>

Hope this'll help
 
1 members found this post helpful.
Old 06-10-2010, 03:25 PM   #3
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
This problem resolves into a problem of parsing HTML, which is a non-trivial exercise, if it is to be done well. If there is much uncertainly at all about the formatting of your HTML, it is probably worthwhile to use something like Perl and one of the existing HTML parser modules.
--- rod.
 
Old 06-12-2010, 02:07 AM   #4
diaco
LQ Newbie
 
Registered: Oct 2009
Location: In a one-bed hotel room
Distribution: Ubuntu Karmic
Posts: 7

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by vonbiber View Post
I would go this way:
1. write a sed script:

<code>
#!/bin/sed -f

:loop
N
$!b loop

s?<h2>?¢?
s?</h2>?£?

s?^\(.*\)\(</[hH][eE][aA][dD]>[^¢]*\)¢\([^£]*\)£?\1¢\3£\2<h2>\3</h2>?

s?\(¢\)\([^\n£]*\)\n?\1\2 ?
s?\(¢\)\([^\n£]*\)\n?\1\2 ?
s?\(¢\)\([^\n£]*\)\n?\1\2 ?

s?¢?<!-- TITLE= "?
s?£?" -->?
</code>

The first 3 lines is to put the contents of the input file
in a single line so that '\n' (the new line character) could be
treated as an ordinary character.
I use the 'cent' and 'british pound' characters as delimiters
(these are unlikely to be found in an html file) to retrieve
what's between the '<h2>' and '</h2>' tags.

Then I place the contents just before the </head> tag
and surrounded by 'cent' and 'pound', replace the 'cent' and 'pound'
below (the ones that appear after </head>) by <h2> and </h2>.

The next 3 lines are for replacing the new line character by a space.

The last 2 lines would replace 'cent' and 'pound' by '<TITLE> ....' and
'...</TITLE>', respectively

All you need to do is save the sed script, eg foo.sed
then

<code>
./foo.sed your_html_file > output_html_file
</code>

Hope this'll help
This one works well. thank you!
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
insert lines into a file after calculating where to insert xonar Programming 13 12-24-2009 04:37 AM
open html page and store contents in file from console delmoras Linux - Newbie 2 11-24-2008 08:39 AM
Anyone know how to Block JavaScript from being run in HTML Comment Editors farmerjoe Programming 8 02-13-2007 02:57 PM
Bash remove part of a file based on contents of another file bhepdogg Programming 4 01-31-2007 03:13 PM
How to put a comment program in HTML. RHLinuxGUY Programming 6 01-18-2006 01:45 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 08:16 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration