[SOLVED] bash script to dynamically edit an html file

melee · 04-14-2010, 12:06 PM

Hey all, I'm having a bit of a problem with a script I'm trying to write. I'll try to give as many details as possible without overwhelming anyone with huge code blocks....

Essentially what I want the script to do is edit an html file based off of the contents of that same file. I'll give an example. FYI, for any of you that are familiar with the html pages that nessus creates, this should look familiar.

The file in question has lines like this:

Code:

	 <td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td>
	<td class=default width="40%">Security warning(s) found</td></tr>

and then way at the bottom of the file, it has lines like this:

Code:

<br>
172.27.1.107 resolves as generic.hostname.com.<br>

<br>

What I need to do is strip the ip address and hostname from the second stanza, and then edit the first stanza so it looks like this:

Code:

	 <td class=default width="30%"><a href="#172_27_1_107">172.27.1.107</a></td>
	<td class=default width="30%"><a href="#172_27_1_107">generic.hostname.com</a></td>
	<td class=default width="40%">Security warning(s) found</td></tr>

I've been (no pun intended) bashing my head against this for several days now and I haven't had any real luck. I can get the basics, sed replaces and sed appends, but the compare and replace is killing me.

Until now, I had been working on the premise of stripping the ip and hostname from stanza 2 and putting them in a separate file (let's call it hostnames.txt), and then running some sort of nested loop that would compare the ip in the first line of hostnames.txt with each line of the nessus.html file. If it found a match, it would attempt a sed replace and then an append based on what those lines look like. I assumed that no other lines would match, and therefore no replace or append would take place until it found the appropriate line. Unfortunately, the nested while loops didn't work, so I've tried rewriting the script multiple times in different ways but nothing is working for me.

I'm relatively new to any bash script longer than 15 lines or so, so I would appreciate any "pointing in the right direction" that anyone can offer.

Thanks!

Mike

Sergei Steshenko · 04-14-2010, 12:23 PM

Quote:

Originally Posted by melee

Hey all, I'm having a bit of a problem with a script I'm trying to write. I'll try to give as many details as possible without overwhelming anyone with huge code blocks....

Essentially what I want the script to do is edit an html file based off of the contents of that same file. I'll give an example. FYI, for any of you that are familiar with the html pages that nessus creates, this should look familiar.

The file in question has lines like this:

Code:

	 <td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td>
	<td class=default width="40%">Security warning(s) found</td></tr>

and then way at the bottom of the file, it has lines like this:

Code:

<br>
172.27.1.107 resolves as generic.hostname.com.<br>

<br>

What I need to do is strip the ip address and hostname from the second stanza, and then edit the first stanza so it looks like this:

Code:

	 <td class=default width="30%"><a href="#172_27_1_107">172.27.1.107</a></td>
	<td class=default width="30%"><a href="#172_27_1_107">generic.hostname.com</a></td>
	<td class=default width="40%">Security warning(s) found</td></tr>

I've been (no pun intended) bashing my head against this for several days now and I haven't had any real luck. I can get the basics, sed replaces and sed appends, but the compare and replace is killing me.

Until now, I had been working on the premise of stripping the ip and hostname from stanza 2 and putting them in a separate file (let's call it hostnames.txt), and then running some sort of nested loop that would compare the ip in the first line of hostnames.txt with each line of the nessus.html file. If it found a match, it would attempt a sed replace and then an append based on what those lines look like. I assumed that no other lines would match, and therefore no replace or append would take place until it found the appropriate line. Unfortunately, the nested while loops didn't work, so I've tried rewriting the script multiple times in different ways but nothing is working for me.

I'm relatively new to any bash script longer than 15 lines or so, so I would appreciate any "pointing in the right direction" that anyone can offer.

Thanks!

Mike

And why are you trying to do this in 'bash' in the first place ?

The book method is:

parse (i.e. convert into a data structure);
modify the data structure;
reconstitute from the modified data structure.

For example, Perl has had HTML parser modules for years, so using an HTML parser in Perl you can do the job.

Or any other language with a decent HTML parser.

melee · 04-14-2010, 01:01 PM

Hey Sergei, thanks for the quick reply.

I was doing this in bash for a couple reasons. 1. as little as I know bash, I know perl even less. I can usually "Forest Gump" my way through a bash script, but I know absolutely zero perl, and for the purposes of this project, I don't think I have time to learn it. And 2. None of the other guys who maintain our systems are perl-savvy either, so if the script needed maintenance, it could conceivably be a hassle.

That said, when someone mentions an 'html parser', I tend to think of a piece of code that is 'html-aware'. i.e. it knows what opening and closing tags are, it knows how to pull hyperlinks out if asked, etc. I was treating this as just text parsing. Some of the text happens to be <'s and >'s, but it's all just text, right?

Am I thinking about this incorrectly?

Sergei Steshenko · 04-14-2010, 01:39 PM

Quote:

Originally Posted by melee

...
That said, when someone mentions an 'html parser', I tend to think of a piece of code that is 'html-aware'. i.e. it knows what opening and closing tags are, it knows how to pull hyperlinks out if asked, etc. I was treating this as just text parsing. Some of the text happens to be <'s and >'s, but it's all just text, right?

Am I thinking about this incorrectly?

Yes, this is what an HTML parser is. I.e. it recognizes HTML constructs as they are defined in the standard.

melee · 04-14-2010, 02:09 PM

So, since I'm not necessarily concerned with HTML constructs, I should be able to do this with a few well placed greps and seds, right?

I can do the individual replaces or appends by hand without any problem. For example,

Code:

>:~/Desktop/nessusscript$ cat example.html 
	 <td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td>
	<td class=default width="40%">Security warning(s) found</td></tr>
>:~/Desktop/nessusscript$ sed -i 's/<td class=default width="60%"><a href="#172_27_1_107">172.27.1.107<\/a><\/td>/<td class=default width="30%"><a href="#172_27_1_107">172.27.1.107<\/a><\/td>/' example.html
>:~/Desktop/nessusscript$ cat example.html 
	 <td class=default width="30%"><a href="#172_27_1_107">172.27.1.107</a></td>
	<td class=default width="40%">Security warning(s) found</td></tr>
>:~/Desktop/nessusscript$ sed -i '/<td class=default width="30%"><a href="#172_27_1_107">172.27.1.107<\/a><\/td>/a\
\t<td class=default width="30%"><a href="#172_27_1_107">generic.hostname.com<\/a><\/td>/' example.html
>:~/Desktop/nessusscript$ cat example.html 
	 <td class=default width="30%"><a href="#172_27_1_107">172.27.1.107</a></td>
	<td class=default width="30%"><a href="#172_27_1_107">generic.hostname.com</a></td>/
	<td class=default width="40%">Security warning(s) found</td></tr>
>:~/Desktop/nessusscript$

In reality, I only need to search the file for two types of lines. One to strip out the hostnames and ip's, and one to search for lines to replace/append. The first search is done and tested. It's the second search that's causing me problems when I try to loop it.

So to my understanding, while this is an html file that I'm parsing, it really has very little to do with html itself and more to do with parsing strings. And bash should be more than capable, yes?

Sergei Steshenko · 04-14-2010, 03:55 PM

Quote:

Originally Posted by melee

So, since I'm not necessarily concerned with HTML constructs, I should be able to do this with a few well placed greps and seds, right?
...

Wrong. HTML is not line-oriented format. I.e. what one day is located on one line can another day be spread on several lines, but the meaning will and thus the way the original HTML page is rendered will stay the same.

melee · 04-14-2010, 06:30 PM

Ok, noted.

So eventually, I'll rewrite this script into a language that makes more sense (probably python as that's the direction my shop is taking).

But for now... Can anyone assist me in doing this in bash? Let's assume for the sake of this argument that the html won't change from what I've posted in this thread.

Anyone?

custangro · 04-14-2010, 06:33 PM

Quote:

Originally Posted by Sergei Steshenko

Wrong. HTML is not line-oriented format. I.e. what one day is located on one line can another day be spread on several lines, but the meaning will and thus the way the original HTML page is rendered will stay the same.

Although perl/php is preferred; it's not impossible to make HTML pages "dynamic" with any shell...

I have many pages written dynamically in ksh for our sites...

-C

custangro · 04-14-2010, 06:34 PM

Quote:

Originally Posted by melee

Hey all, I'm having a bit of a problem with a script I'm trying to write. I'll try to give as many details as possible without overwhelming anyone with huge code blocks....

Essentially what I want the script to do is edit an html file based off of the contents of that same file. I'll give an example. FYI, for any of you that are familiar with the html pages that nessus creates, this should look familiar.

The file in question has lines like this:

Code:

	 <td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td>
	<td class=default width="40%">Security warning(s) found</td></tr>

and then way at the bottom of the file, it has lines like this:

Code:

<br>
172.27.1.107 resolves as generic.hostname.com.<br>

<br>

What I need to do is strip the ip address and hostname from the second stanza, and then edit the first stanza so it looks like this:

Code:

	 <td class=default width="30%"><a href="#172_27_1_107">172.27.1.107</a></td>
	<td class=default width="30%"><a href="#172_27_1_107">generic.hostname.com</a></td>
	<td class=default width="40%">Security warning(s) found</td></tr>

I've been (no pun intended) bashing my head against this for several days now and I haven't had any real luck. I can get the basics, sed replaces and sed appends, but the compare and replace is killing me.

Until now, I had been working on the premise of stripping the ip and hostname from stanza 2 and putting them in a separate file (let's call it hostnames.txt), and then running some sort of nested loop that would compare the ip in the first line of hostnames.txt with each line of the nessus.html file. If it found a match, it would attempt a sed replace and then an append based on what those lines look like. I assumed that no other lines would match, and therefore no replace or append would take place until it found the appropriate line. Unfortunately, the nested while loops didn't work, so I've tried rewriting the script multiple times in different ways but nothing is working for me.

I'm relatively new to any bash script longer than 15 lines or so, so I would appreciate any "pointing in the right direction" that anyone can offer.

Thanks!

Mike

What have you written so far? Can you post your code?

-C

grail · 04-14-2010, 08:47 PM

Whilst I agree with Sergei that bash may not be the best use here, I did notice the following (correct me if wrong):

Code:

<td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td>

Once this is found within the file it simply needs to become:

Code:

<td class=default width="30%"><a href="#172_27_1_107">172.27.1.107</a></td>
<td class=default width="30%"><a href="#172_27_1_107">generic.hostname.com</a></td>

ie the changes are those in red

If so, awk or sed could probably do this for you.

melee · 04-15-2010, 07:32 AM

I agree wholeheartedly grail. Unfortunately, that's exactly the problem I'm having. I need the script to look through each line of File 1 until it finds a line like:

Code:

<td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td>

and then strip out the ip address, compare it to the hostnames.txt file and pull out the hostname that corresponds to that ip (they'd be on the same line). Then the script would need to do an append of the second line depending on what hostname it found.

Should I be looking at awk for that functionality?

Thanks.

grail · 04-15-2010, 08:04 AM

Do we have to do the 60 to 30 change, I struggled with that?

Also, please supply one or two lines from the hostnames.txt file for comparison.

melee · 04-15-2010, 08:49 AM

Sure, hostnames.txt would look like this:

Code:

172.27.1.107 generic.hostname.com
172.27.1.108 generic2.hostname.com

And the change from 60 to 30 isn't an issue for the purposes of this thread. I can do that with sed pretty quickly. For the sake of non-complexity, we can say that I'd like the end result to go from this;

Code:

<td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td>

to this:

Code:

<td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td>
<td class=default width="60%"><a href="#172_27_1_107">generic.hostname.com</a></td>

grail · 04-15-2010, 08:55 AM

Okay ... so see what ya think (found out a way for 60 to 30 too):

Code:

awk 'BEGIN{FS="[\\||>*<*]"}ARGV[1] == FILENAME{_[$1]=$2}ARGV[2] == FILENAME{if($5 in _){match($2,/[0-9]+/,pc);gsub(pc[0],pc[0]/2)}print $0"\n"gensub($5,_[$5],2)}' host html

Only diff here is my host file is separated by a pipe "|", just made it a little clearer.
If you stay with space just change:

Code:

FS="[\\||>*<*]" to FS="[ |>*<*]"

Edit: Sorry just found this does not work, ie adding space, as then the second file also has spaces.
You will need a delimeter in host file.

Edit 2: can work with bigger numbers but this could get screwy

Code:

awk 'BEGIN{FS="[ |>*<*]"}ARGV[1] == FILENAME{_[$1]=$2}ARGV[2] == FILENAME{if($8 in _){match($4,/[0-9]+/,pc);gsub(pc[0],pc[0]/2)}print $0"\n"gensub($8,_[$8],2)}' host html

melee · 04-15-2010, 11:02 AM

Wow. Thanks grail. I haven't tried this out yet, as I may need some help figuring out where my filenames go.

Do I just replace "host" and "html" at the end of the script with my hostnames.txt and nessus.html file, respectively? If so, does this script just output to stdout? That's fine if it does, I can always redirect, I just want to understand what this script is doing.

Google tells me what gsub and gensub are, but what is "pc" and "match"? Are those just variables? Or I guess in this case, maybe arrays?

Again, I'm just trying to understand what it is that the script does. I'd hate to walk away from this with a cool awk one-liner, but no knowledge of how it works.

Thanks.