LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 04-14-2010, 12:06 PM   #1
melee
Member
 
Registered: Sep 2004
Location: Austin, TX
Distribution: Ubuntu, CentOS
Posts: 86

Rep: Reputation: 15
bash script to dynamically edit an html file


Hey all, I'm having a bit of a problem with a script I'm trying to write. I'll try to give as many details as possible without overwhelming anyone with huge code blocks....

Essentially what I want the script to do is edit an html file based off of the contents of that same file. I'll give an example. FYI, for any of you that are familiar with the html pages that nessus creates, this should look familiar.

The file in question has lines like this:

Code:
	 <td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td>
	<td class=default width="40%">Security warning(s) found</td></tr>

and then way at the bottom of the file, it has lines like this:

Code:
<br>
172.27.1.107 resolves as generic.hostname.com.<br>

<br>

What I need to do is strip the ip address and hostname from the second stanza, and then edit the first stanza so it looks like this:

Code:
	 <td class=default width="30%"><a href="#172_27_1_107">172.27.1.107</a></td>
	<td class=default width="30%"><a href="#172_27_1_107">generic.hostname.com</a></td>
	<td class=default width="40%">Security warning(s) found</td></tr>
I've been (no pun intended) bashing my head against this for several days now and I haven't had any real luck. I can get the basics, sed replaces and sed appends, but the compare and replace is killing me.

Until now, I had been working on the premise of stripping the ip and hostname from stanza 2 and putting them in a separate file (let's call it hostnames.txt), and then running some sort of nested loop that would compare the ip in the first line of hostnames.txt with each line of the nessus.html file. If it found a match, it would attempt a sed replace and then an append based on what those lines look like. I assumed that no other lines would match, and therefore no replace or append would take place until it found the appropriate line. Unfortunately, the nested while loops didn't work, so I've tried rewriting the script multiple times in different ways but nothing is working for me.

I'm relatively new to any bash script longer than 15 lines or so, so I would appreciate any "pointing in the right direction" that anyone can offer.

Thanks!

Mike
 
Old 04-14-2010, 12:23 PM   #2
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
Quote:
Originally Posted by melee View Post
Hey all, I'm having a bit of a problem with a script I'm trying to write. I'll try to give as many details as possible without overwhelming anyone with huge code blocks....

Essentially what I want the script to do is edit an html file based off of the contents of that same file. I'll give an example. FYI, for any of you that are familiar with the html pages that nessus creates, this should look familiar.

The file in question has lines like this:

Code:
	 <td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td>
	<td class=default width="40%">Security warning(s) found</td></tr>

and then way at the bottom of the file, it has lines like this:

Code:
<br>
172.27.1.107 resolves as generic.hostname.com.<br>

<br>

What I need to do is strip the ip address and hostname from the second stanza, and then edit the first stanza so it looks like this:

Code:
	 <td class=default width="30%"><a href="#172_27_1_107">172.27.1.107</a></td>
	<td class=default width="30%"><a href="#172_27_1_107">generic.hostname.com</a></td>
	<td class=default width="40%">Security warning(s) found</td></tr>
I've been (no pun intended) bashing my head against this for several days now and I haven't had any real luck. I can get the basics, sed replaces and sed appends, but the compare and replace is killing me.

Until now, I had been working on the premise of stripping the ip and hostname from stanza 2 and putting them in a separate file (let's call it hostnames.txt), and then running some sort of nested loop that would compare the ip in the first line of hostnames.txt with each line of the nessus.html file. If it found a match, it would attempt a sed replace and then an append based on what those lines look like. I assumed that no other lines would match, and therefore no replace or append would take place until it found the appropriate line. Unfortunately, the nested while loops didn't work, so I've tried rewriting the script multiple times in different ways but nothing is working for me.

I'm relatively new to any bash script longer than 15 lines or so, so I would appreciate any "pointing in the right direction" that anyone can offer.

Thanks!

Mike
And why are you trying to do this in 'bash' in the first place ?

The book method is:
  1. parse (i.e. convert into a data structure);
  2. modify the data structure;
  3. reconstitute from the modified data structure.

For example, Perl has had HTML parser modules for years, so using an HTML parser in Perl you can do the job.

Or any other language with a decent HTML parser.
 
Old 04-14-2010, 01:01 PM   #3
melee
Member
 
Registered: Sep 2004
Location: Austin, TX
Distribution: Ubuntu, CentOS
Posts: 86

Original Poster
Rep: Reputation: 15
Hey Sergei, thanks for the quick reply.

I was doing this in bash for a couple reasons. 1. as little as I know bash, I know perl even less. I can usually "Forest Gump" my way through a bash script, but I know absolutely zero perl, and for the purposes of this project, I don't think I have time to learn it. And 2. None of the other guys who maintain our systems are perl-savvy either, so if the script needed maintenance, it could conceivably be a hassle.

That said, when someone mentions an 'html parser', I tend to think of a piece of code that is 'html-aware'. i.e. it knows what opening and closing tags are, it knows how to pull hyperlinks out if asked, etc. I was treating this as just text parsing. Some of the text happens to be <'s and >'s, but it's all just text, right?

Am I thinking about this incorrectly?
 
Old 04-14-2010, 01:39 PM   #4
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
Quote:
Originally Posted by melee View Post
...
That said, when someone mentions an 'html parser', I tend to think of a piece of code that is 'html-aware'. i.e. it knows what opening and closing tags are, it knows how to pull hyperlinks out if asked, etc. I was treating this as just text parsing. Some of the text happens to be <'s and >'s, but it's all just text, right?

Am I thinking about this incorrectly?
Yes, this is what an HTML parser is. I.e. it recognizes HTML constructs as they are defined in the standard.

Last edited by Sergei Steshenko; 04-14-2010 at 01:40 PM.
 
Old 04-14-2010, 02:09 PM   #5
melee
Member
 
Registered: Sep 2004
Location: Austin, TX
Distribution: Ubuntu, CentOS
Posts: 86

Original Poster
Rep: Reputation: 15
So, since I'm not necessarily concerned with HTML constructs, I should be able to do this with a few well placed greps and seds, right?

I can do the individual replaces or appends by hand without any problem. For example,

Code:
>:~/Desktop/nessusscript$ cat example.html 
	 <td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td>
	<td class=default width="40%">Security warning(s) found</td></tr>
>:~/Desktop/nessusscript$ sed -i 's/<td class=default width="60%"><a href="#172_27_1_107">172.27.1.107<\/a><\/td>/<td class=default width="30%"><a href="#172_27_1_107">172.27.1.107<\/a><\/td>/' example.html
>:~/Desktop/nessusscript$ cat example.html 
	 <td class=default width="30%"><a href="#172_27_1_107">172.27.1.107</a></td>
	<td class=default width="40%">Security warning(s) found</td></tr>
>:~/Desktop/nessusscript$ sed -i '/<td class=default width="30%"><a href="#172_27_1_107">172.27.1.107<\/a><\/td>/a\
\t<td class=default width="30%"><a href="#172_27_1_107">generic.hostname.com<\/a><\/td>/' example.html
>:~/Desktop/nessusscript$ cat example.html 
	 <td class=default width="30%"><a href="#172_27_1_107">172.27.1.107</a></td>
	<td class=default width="30%"><a href="#172_27_1_107">generic.hostname.com</a></td>/
	<td class=default width="40%">Security warning(s) found</td></tr>
>:~/Desktop/nessusscript$

In reality, I only need to search the file for two types of lines. One to strip out the hostnames and ip's, and one to search for lines to replace/append. The first search is done and tested. It's the second search that's causing me problems when I try to loop it.

So to my understanding, while this is an html file that I'm parsing, it really has very little to do with html itself and more to do with parsing strings. And bash should be more than capable, yes?
 
Old 04-14-2010, 03:55 PM   #6
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 453Reputation: 453Reputation: 453Reputation: 453Reputation: 453
Quote:
Originally Posted by melee View Post
So, since I'm not necessarily concerned with HTML constructs, I should be able to do this with a few well placed greps and seds, right?
...
Wrong. HTML is not line-oriented format. I.e. what one day is located on one line can another day be spread on several lines, but the meaning will and thus the way the original HTML page is rendered will stay the same.
 
Old 04-14-2010, 06:30 PM   #7
melee
Member
 
Registered: Sep 2004
Location: Austin, TX
Distribution: Ubuntu, CentOS
Posts: 86

Original Poster
Rep: Reputation: 15
Ok, noted.

So eventually, I'll rewrite this script into a language that makes more sense (probably python as that's the direction my shop is taking).

But for now... Can anyone assist me in doing this in bash? Let's assume for the sake of this argument that the html won't change from what I've posted in this thread.

Anyone?
 
Old 04-14-2010, 06:33 PM   #8
custangro
Senior Member
 
Registered: Nov 2006
Location: California
Distribution: Fedora , CentOS , Solaris 10, RHEL
Posts: 1,933
Blog Entries: 1

Rep: Reputation: 188Reputation: 188
Quote:
Originally Posted by Sergei Steshenko View Post
Wrong. HTML is not line-oriented format. I.e. what one day is located on one line can another day be spread on several lines, but the meaning will and thus the way the original HTML page is rendered will stay the same.
Although perl/php is preferred; it's not impossible to make HTML pages "dynamic" with any shell...

I have many pages written dynamically in ksh for our sites...

-C
 
Old 04-14-2010, 06:34 PM   #9
custangro
Senior Member
 
Registered: Nov 2006
Location: California
Distribution: Fedora , CentOS , Solaris 10, RHEL
Posts: 1,933
Blog Entries: 1

Rep: Reputation: 188Reputation: 188
Quote:
Originally Posted by melee View Post
Hey all, I'm having a bit of a problem with a script I'm trying to write. I'll try to give as many details as possible without overwhelming anyone with huge code blocks....

Essentially what I want the script to do is edit an html file based off of the contents of that same file. I'll give an example. FYI, for any of you that are familiar with the html pages that nessus creates, this should look familiar.

The file in question has lines like this:

Code:
	 <td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td>
	<td class=default width="40%">Security warning(s) found</td></tr>

and then way at the bottom of the file, it has lines like this:

Code:
<br>
172.27.1.107 resolves as generic.hostname.com.<br>

<br>

What I need to do is strip the ip address and hostname from the second stanza, and then edit the first stanza so it looks like this:

Code:
	 <td class=default width="30%"><a href="#172_27_1_107">172.27.1.107</a></td>
	<td class=default width="30%"><a href="#172_27_1_107">generic.hostname.com</a></td>
	<td class=default width="40%">Security warning(s) found</td></tr>
I've been (no pun intended) bashing my head against this for several days now and I haven't had any real luck. I can get the basics, sed replaces and sed appends, but the compare and replace is killing me.

Until now, I had been working on the premise of stripping the ip and hostname from stanza 2 and putting them in a separate file (let's call it hostnames.txt), and then running some sort of nested loop that would compare the ip in the first line of hostnames.txt with each line of the nessus.html file. If it found a match, it would attempt a sed replace and then an append based on what those lines look like. I assumed that no other lines would match, and therefore no replace or append would take place until it found the appropriate line. Unfortunately, the nested while loops didn't work, so I've tried rewriting the script multiple times in different ways but nothing is working for me.

I'm relatively new to any bash script longer than 15 lines or so, so I would appreciate any "pointing in the right direction" that anyone can offer.

Thanks!

Mike
What have you written so far? Can you post your code?

-C
 
Old 04-14-2010, 08:47 PM   #10
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,412

Rep: Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874
Whilst I agree with Sergei that bash may not be the best use here, I did notice the following (correct me if wrong):
Code:
<td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td>
Once this is found within the file it simply needs to become:
Code:
<td class=default width="30%"><a href="#172_27_1_107">172.27.1.107</a></td>
<td class=default width="30%"><a href="#172_27_1_107">generic.hostname.com</a></td>
ie the changes are those in red

If so, awk or sed could probably do this for you.
 
Old 04-15-2010, 07:32 AM   #11
melee
Member
 
Registered: Sep 2004
Location: Austin, TX
Distribution: Ubuntu, CentOS
Posts: 86

Original Poster
Rep: Reputation: 15
I agree wholeheartedly grail. Unfortunately, that's exactly the problem I'm having. I need the script to look through each line of File 1 until it finds a line like:

Code:
<td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td>
and then strip out the ip address, compare it to the hostnames.txt file and pull out the hostname that corresponds to that ip (they'd be on the same line). Then the script would need to do an append of the second line depending on what hostname it found.

Should I be looking at awk for that functionality?

Thanks.
 
Old 04-15-2010, 08:04 AM   #12
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,412

Rep: Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874
Do we have to do the 60 to 30 change, I struggled with that?

Also, please supply one or two lines from the hostnames.txt file for comparison.
 
Old 04-15-2010, 08:49 AM   #13
melee
Member
 
Registered: Sep 2004
Location: Austin, TX
Distribution: Ubuntu, CentOS
Posts: 86

Original Poster
Rep: Reputation: 15
Sure, hostnames.txt would look like this:
Code:
172.27.1.107 generic.hostname.com
172.27.1.108 generic2.hostname.com

And the change from 60 to 30 isn't an issue for the purposes of this thread. I can do that with sed pretty quickly. For the sake of non-complexity, we can say that I'd like the end result to go from this;

Code:
<td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td>
to this:

Code:
<td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td>
<td class=default width="60%"><a href="#172_27_1_107">generic.hostname.com</a></td>
 
Old 04-15-2010, 08:55 AM   #14
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,412

Rep: Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874Reputation: 1874
Okay ... so see what ya think (found out a way for 60 to 30 too):
Code:
awk 'BEGIN{FS="[\\||>*<*]"}ARGV[1] == FILENAME{_[$1]=$2}ARGV[2] == FILENAME{if($5 in _){match($2,/[0-9]+/,pc);gsub(pc[0],pc[0]/2)}print $0"\n"gensub($5,_[$5],2)}' host html
Only diff here is my host file is separated by a pipe "|", just made it a little clearer.
If you stay with space just change:
Code:
FS="[\\||>*<*]" to FS="[ |>*<*]"
Edit: Sorry just found this does not work, ie adding space, as then the second file also has spaces.
You will need a delimeter in host file.

Edit 2: can work with bigger numbers but this could get screwy
Code:
awk 'BEGIN{FS="[ |>*<*]"}ARGV[1] == FILENAME{_[$1]=$2}ARGV[2] == FILENAME{if($8 in _){match($4,/[0-9]+/,pc);gsub(pc[0],pc[0]/2)}print $0"\n"gensub($8,_[$8],2)}' host html

Last edited by grail; 04-15-2010 at 09:04 AM.
 
Old 04-15-2010, 11:02 AM   #15
melee
Member
 
Registered: Sep 2004
Location: Austin, TX
Distribution: Ubuntu, CentOS
Posts: 86

Original Poster
Rep: Reputation: 15
Wow. Thanks grail. I haven't tried this out yet, as I may need some help figuring out where my filenames go. Do I just replace "host" and "html" at the end of the script with my hostnames.txt and nessus.html file, respectively? If so, does this script just output to stdout? That's fine if it does, I can always redirect, I just want to understand what this script is doing.

Google tells me what gsub and gensub are, but what is "pc" and "match"? Are those just variables? Or I guess in this case, maybe arrays?

Again, I'm just trying to understand what it is that the script does. I'd hate to walk away from this with a cool awk one-liner, but no knowledge of how it works.

Thanks.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Need help create a bash script to edit CSV File imkornhulio Programming 13 02-05-2009 10:23 AM
How to edit a HTML file in linux shaiful Linux - Newbie 11 11-24-2008 12:36 PM
bash shell script find and edit fields in a file hchoonbeng Programming 9 10-29-2008 02:13 AM
Force Script (Bash) to Edit File? carlosinfl Linux - General 3 10-02-2008 10:57 AM
Bash script to edit text file snowman81 Linux - Desktop 2 01-10-2007 03:33 PM


All times are GMT -5. The time now is 04:46 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration