bash script to dynamically edit an html file
Hey all, I'm having a bit of a problem with a script I'm trying to write. I'll try to give as many details as possible without overwhelming anyone with huge code blocks....
Essentially what I want the script to do is edit an html file based off of the contents of that same file. I'll give an example. FYI, for any of you that are familiar with the html pages that nessus creates, this should look familiar. The file in question has lines like this: Code:
<td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td> and then way at the bottom of the file, it has lines like this: Code:
<br> What I need to do is strip the ip address and hostname from the second stanza, and then edit the first stanza so it looks like this: Code:
<td class=default width="30%"><a href="#172_27_1_107">172.27.1.107</a></td> Until now, I had been working on the premise of stripping the ip and hostname from stanza 2 and putting them in a separate file (let's call it hostnames.txt), and then running some sort of nested loop that would compare the ip in the first line of hostnames.txt with each line of the nessus.html file. If it found a match, it would attempt a sed replace and then an append based on what those lines look like. I assumed that no other lines would match, and therefore no replace or append would take place until it found the appropriate line. Unfortunately, the nested while loops didn't work, so I've tried rewriting the script multiple times in different ways but nothing is working for me. I'm relatively new to any bash script longer than 15 lines or so, so I would appreciate any "pointing in the right direction" that anyone can offer. Thanks! Mike |
Quote:
The book method is:
For example, Perl has had HTML parser modules for years, so using an HTML parser in Perl you can do the job. Or any other language with a decent HTML parser. |
Hey Sergei, thanks for the quick reply.
I was doing this in bash for a couple reasons. 1. as little as I know bash, I know perl even less. I can usually "Forest Gump" my way through a bash script, but I know absolutely zero perl, and for the purposes of this project, I don't think I have time to learn it. And 2. None of the other guys who maintain our systems are perl-savvy either, so if the script needed maintenance, it could conceivably be a hassle. That said, when someone mentions an 'html parser', I tend to think of a piece of code that is 'html-aware'. i.e. it knows what opening and closing tags are, it knows how to pull hyperlinks out if asked, etc. I was treating this as just text parsing. Some of the text happens to be <'s and >'s, but it's all just text, right? Am I thinking about this incorrectly? |
Quote:
|
So, since I'm not necessarily concerned with HTML constructs, I should be able to do this with a few well placed greps and seds, right?
I can do the individual replaces or appends by hand without any problem. For example, Code:
>:~/Desktop/nessusscript$ cat example.html In reality, I only need to search the file for two types of lines. One to strip out the hostnames and ip's, and one to search for lines to replace/append. The first search is done and tested. It's the second search that's causing me problems when I try to loop it. So to my understanding, while this is an html file that I'm parsing, it really has very little to do with html itself and more to do with parsing strings. And bash should be more than capable, yes? |
Quote:
|
Ok, noted.
So eventually, I'll rewrite this script into a language that makes more sense (probably python as that's the direction my shop is taking). But for now... Can anyone assist me in doing this in bash? Let's assume for the sake of this argument that the html won't change from what I've posted in this thread. Anyone? |
Quote:
I have many pages written dynamically in ksh for our sites... -C |
Quote:
-C |
Whilst I agree with Sergei that bash may not be the best use here, I did notice the following (correct me if wrong):
Code:
<td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td> Code:
<td class=default width="30%"><a href="#172_27_1_107">172.27.1.107</a></td> If so, awk or sed could probably do this for you. |
I agree wholeheartedly grail. Unfortunately, that's exactly the problem I'm having. I need the script to look through each line of File 1 until it finds a line like:
Code:
<td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td> Should I be looking at awk for that functionality? Thanks. |
Do we have to do the 60 to 30 change, I struggled with that?
Also, please supply one or two lines from the hostnames.txt file for comparison. |
Sure, hostnames.txt would look like this:
Code:
172.27.1.107 generic.hostname.com And the change from 60 to 30 isn't an issue for the purposes of this thread. I can do that with sed pretty quickly. For the sake of non-complexity, we can say that I'd like the end result to go from this; Code:
<td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td> Code:
<td class=default width="60%"><a href="#172_27_1_107">172.27.1.107</a></td> |
Okay ... so see what ya think (found out a way for 60 to 30 too):
Code:
awk 'BEGIN{FS="[\\||>*<*]"}ARGV[1] == FILENAME{_[$1]=$2}ARGV[2] == FILENAME{if($5 in _){match($2,/[0-9]+/,pc);gsub(pc[0],pc[0]/2)}print $0"\n"gensub($5,_[$5],2)}' host html If you stay with space just change: Code:
FS="[\\||>*<*]" to FS="[ |>*<*]" You will need a delimeter in host file. Edit 2: can work with bigger numbers but this could get screwy :( Code:
awk 'BEGIN{FS="[ |>*<*]"}ARGV[1] == FILENAME{_[$1]=$2}ARGV[2] == FILENAME{if($8 in _){match($4,/[0-9]+/,pc);gsub(pc[0],pc[0]/2)}print $0"\n"gensub($8,_[$8],2)}' host html |
Wow. Thanks grail. I haven't tried this out yet, as I may need some help figuring out where my filenames go. :) Do I just replace "host" and "html" at the end of the script with my hostnames.txt and nessus.html file, respectively? If so, does this script just output to stdout? That's fine if it does, I can always redirect, I just want to understand what this script is doing.
Google tells me what gsub and gensub are, but what is "pc" and "match"? Are those just variables? Or I guess in this case, maybe arrays? Again, I'm just trying to understand what it is that the script does. I'd hate to walk away from this with a cool awk one-liner, but no knowledge of how it works. :) Thanks. |
All times are GMT -5. The time now is 08:49 PM. |