force grep to keep it's place in file for next iteration?

jeffreybluml · 05-12-2005, 07:51 AM

Hmmm, not sure how to put this...

I have a large html file with multiple phone number results in it. For each phone number listing, the name (ie Taco Bell) is seperated by many lines of html code before the corresponding phone number. So, I'm grepping through it and trying to grab each name, then grab the phone number, the the next name, then phone, etc...

I'm doing this is an until loop, and currently grep grabs the exact same entry each interation.

How can I make grep resume looking where it left off, or ignore the previously found entries?

Also, can I get grep to ignore duplicate entries? I swear I saw something about duplicate lines while reading the man page, but I can't find it now...

Here's the code:

Code:

x=1;
until [ $x -eq 30 ];
do 
let "x = $x + 1";
grep -m 1 -B 3 '([[:alnum:]][[:alnum:]][[:alnum:]]) [[:alnum:]][[:alnum:]][[:alnum:]]-[[:alnum:]][[:alnum:]][[:alnum:]][[:alnum:]]' /var/www/html/files/phone.html | sed 's/<br/<br\//g' | sed 's/--/<br\/>/g' | sed 's/&\ //
g' >> /var/www/html/files/results.inc && grep -m 1 listingname /var/www/html/files/phone.html | sed 's/<span\ class=\"listingname\">//' | sed 's/<\/span>/<br\/>/' >> /var/www/html/files/results.inc;
done;

Any suggestions?

Thanks as always!

theYinYeti · 05-12-2005, 07:59 AM

I've always found that unix tools are hard to use and unreliable at best, where XML-related files are concerned (such as HTML). So:
- if your document is a well-formed XML document (eg: a XHTML doc), then you might prefer using a XML parser, or something along the lines of my xpathRead (http://yves.gablin.club.fr/pc/www.php?lang=fr (French)).
- else you will probably have to pre-process your file, so that it becomes more adapted to Unix tools.

Yves

PTrenholme · 05-12-2005, 08:06 AM

Try one of the other text processing tools. I'd suggest either gawk or perl. AWK is probably easier . . .

jeffreybluml · 05-12-2005, 08:13 AM

Okay, thanks. In an effort to reduce my learning curve, andbody feel up to showing me how I'd accomplish this in awk?

If you'd like to see what the html page I'm stripping looks like, here's an example page (it's a search from dexonline.com)

http://dexonline.com/servlet/ActionS...dingAreas=true

Greatly appreciate it...

PTrenholme · 05-12-2005, 10:11 PM

OK, here's an example:

Code:

$ cat tmp/test.awk
# Find the (next) listingname
/<span class=\"listingname\">/ {
# Get the name from the field. NOTE: Assumes the name is in the same physical line as "listingname"
    name = "--Unnamed--";
    if (match($0, /(>)([^<]+)(<\/span>)/, vals) > 0) {
      name = vals[2];}
# Start serching fo the 'phone number
    while (getline > 0) {
# No number found if we've hit the next "listingname"
      if ($0 ~ /<span class=\"listingname\">/ ) {
        print name ": No 'phone number listed";
        name = "--Unnamed--";
        if (match($0, /(>)([^<]+)(<\/span>)/, vals) > 0) {
          name = vals[2];}
        if (getline <= 0) {
          break;
        }
      }
# Look for the "bold"  delimiter followed by an open paren.  (This is, of course, a kludg.)
      if ($0 ~ /<b>\(/) {
#Break the line apart by the HTML delimiters
        FS="[<>]";
        $0 = $0;
# And the 'phone number should be the third field
        phone = (NF > 3) ? $3 : "Unpsecified";
        print name ": " phone;
# Restore the field seperator
        FS=" ";
# And go back to find the next listingname . . .
        break;
      }
    }
  }

And here's the output from your example html:

Code:

$ gawk -f tmp/test.awk Downloads/html/ActionServlet.html
Holiday/Taco Bell: (952) 758-5252
Taco Bell: (651) 982-0434
Taco Bell: (715) 386-3006
Taco Bell: (952) 470-8909
Taco Bell/Long John Silvers #21459: (763) 259-0762
Taco Bell: (763) 259-0762
Taco Bell: (651) 770-0978
Taco Bell: (952) 226-6210
Taco Bell: (763) 383-1103
TACO BELL: (763) 477-4096
Taco Bell: (952) 854-5255
Taco Bell: (952) 544-0128
Taco Bell Baker Center: (612) 359-9527
Taco Bell Express: (763) 542-9219
Taco Bell Express: (952) 888-6292
Taco Bell Restaurants: (763) 757-8976
Taco Bell Restaurants: (763) 502-0399
Taco Bell Restaurants: (952) 953-4553
Taco Bell Restaurants: (952) 233-1561
Taco Bell Restaurants: (952) 892-6670
$

jeffreybluml · 05-13-2005, 07:04 AM

Really appreciate your time PTrenholme,

Couple questions, (sorry)

Foremost, and I realize this will show just how little I know, but what do I do with that code? I pasted it into a file I named "test" in the directory, made it executable, edited the first line to:

first: point to the phone.html doc
second: point to phone.awk, and copied phone.html oto phone.awk ( I know that was silly, grasping at straws already at this point)
third: removed the "$" at the beginning of the first line for both above examples

After each of these, I did:

./test

and more often than not it just spit out the html at me or gavbe me:

./test: line 1: $: command not found
./test: line 3: span: No such file or directory
./test: line 5: name: command not found
./test: line 6: syntax error near unexpected token `$0,'
./test: line 6: ` if (match($0, /(> )([^<]+)(<\/span> )/, vals) > 0) {'

So, I'm obviously not implementing this correctly, and I'm feeling a little stupid again.

Next, and I hope this doesn't make me sound greedy considering the work you've already done for me, but it there a way to do this that will capture the two lines above the phone number as well? I'd like to get the addresses returned as well. Preferaby it would then list the name, next two lines would be the address, and then the last line for each would be the phone number. The order is of the least importance, I'd like to get all the info to the page...

Again, thanks for helping me this far!

Jeff

PTrenholme · 05-13-2005, 09:37 AM

Let me think about the address -- it should be no problem.

As to the implementation, the lines with the "$" in them indicate the commands typed in the terminal window. (The $ is the command line prompt sequence normally used, or, at least, the last character of that sequence. What comes after it is the command you type. Conventionally, "$" indicates "user mode" whilst "#" denotes "superuser or root mode.")

So, go back and copy everything after the "cat" command ("cat" is Unix for what DOS chose to call "type") up to (but not including) the next "$" into a file called "test.awk," which is not an executable file, just a regular text file.

Then enter the "gawk -f test.awk <your html file>" command. Again, do this from a command prompt in a terminal window. (What you're doing is starting the gawk interpreter with the file [-f] "test.awk" of commands, and applying those commands to the last argument. (Obviously [I hope], the awk file name and extension are entirely up to you, as is the last file name. The .awk extension on the test files is just a convention.)

jeffreybluml · 05-13-2005, 10:16 AM

Wheeeeeeeeeeeeeeee!!!!!!!!!!!!!!!!!!!!!!!!!!!

I think I've got it!!

It's kind of sloppy again, but this seems to do what I want...

Code:

#!/bin/bash
sudo tail -n 20 /var/log/httpd/access_log | grep  servlet | sed 's/.*GET/http\:\/\/dexonline.com/' | sed 's/HTTP\/1.1\"\ 404\ 573//' | sed 's/\ //g' > /var/www/html/files/address;
j=1; for i in $(cat /var/www/html/files/address); do address=$i; wget -o /var/www/html/wget_log_phonesearch -O /var/www/html/files/phone.html -r $i; j=$(($j+1)); done;
grep -m 30 listingname /var/www/html/files/phone.html | sed 's/<span\ class=\"listingname\">//' | sed 's/<\/span>/<br\/>/' > /var/www/html/files/names;
grep -m 30 -B 2 '([[:alnum:]][[:alnum:]][[:alnum:]]) [[:alnum:]][[:alnum:]][[:alnum:]]-[[:alnum:]][[:alnum:]][[:alnum:]][[:alnum:]]' /var/www/html/files/phone.html | sed 's/<br/<br\//g' | sed 's/--/<br\/>/g' | sed 's/&\ //g' > /var/www/html/files/numbers;
echo "" > /var/www/html/files/tempnames;
echo "" > /var/www/html/files/results.inc;
g=1;l=4;
until [ $g -eq 30 ];
 do head -n $g /var/www/html/files/names >> /var/www/html/files/tempnames; tail -n 1 /var/www/html/files/tempnames >> /var/www/html/files/results.inc; head -n $l /var/www/html/files/numbers | tail -n 4 >> /var/www/html/files/results.inc; l=$(($l+4));
let "g = $g + 1";
done;

This properly returns all the listings, preceded by their correct name. Woohoo!!!!

Thanks again for the help. I feel bad for not using the awk method you spent time coming up with, but I just couldn't stop tinkering around with grep, and then I had a moment of clarity and - poof! - I had it right.

Still wish there was a way to get rid of the step wherein I request the non-existant URL from my server in order to get said URL as a variable for wget...any expertise there?

Thanks again...