LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 05-12-2005, 07:51 AM   #1
jeffreybluml
Member
 
Registered: Mar 2004
Location: Minnesota
Distribution: Fedora Core 1, Mandrake 10
Posts: 405

Rep: Reputation: 30
force grep to keep it's place in file for next iteration?


Hmmm, not sure how to put this...

I have a large html file with multiple phone number results in it. For each phone number listing, the name (ie Taco Bell) is seperated by many lines of html code before the corresponding phone number. So, I'm grepping through it and trying to grab each name, then grab the phone number, the the next name, then phone, etc...

I'm doing this is an until loop, and currently grep grabs the exact same entry each interation.

How can I make grep resume looking where it left off, or ignore the previously found entries?

Also, can I get grep to ignore duplicate entries? I swear I saw something about duplicate lines while reading the man page, but I can't find it now...

Here's the code:
Code:
x=1;
until [ $x -eq 30 ];
do 
let "x = $x + 1";
grep -m 1 -B 3 '([[:alnum:]][[:alnum:]][[:alnum:]]) [[:alnum:]][[:alnum:]][[:alnum:]]-[[:alnum:]][[:alnum:]][[:alnum:]][[:alnum:]]' /var/www/html/files/phone.html | sed 's/<br/<br\//g' | sed 's/--/<br\/>/g' | sed 's/&\ //
g' >> /var/www/html/files/results.inc && grep -m 1 listingname /var/www/html/files/phone.html | sed 's/<span\ class=\"listingname\">//' | sed 's/<\/span>/<br\/>/' >> /var/www/html/files/results.inc;
done;
Any suggestions?

Thanks as always!

Last edited by jeffreybluml; 05-12-2005 at 07:53 AM.
 
Old 05-12-2005, 07:59 AM   #2
theYinYeti
Senior Member
 
Registered: Jul 2004
Location: France
Distribution: Arch Linux
Posts: 1,897

Rep: Reputation: 61
I've always found that unix tools are hard to use and unreliable at best, where XML-related files are concerned (such as HTML). So:
- if your document is a well-formed XML document (eg: a XHTML doc), then you might prefer using a XML parser, or something along the lines of my xpathRead (http://yves.gablin.club.fr/pc/www.php?lang=fr (French)).
- else you will probably have to pre-process your file, so that it becomes more adapted to Unix tools.

Yves
 
Old 05-12-2005, 08:06 AM   #3
PTrenholme
Senior Member
 
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,149

Rep: Reputation: 330Reputation: 330Reputation: 330Reputation: 330
Try one of the other text processing tools. I'd suggest either gawk or perl. AWK is probably easier . . .
 
Old 05-12-2005, 08:13 AM   #4
jeffreybluml
Member
 
Registered: Mar 2004
Location: Minnesota
Distribution: Fedora Core 1, Mandrake 10
Posts: 405

Original Poster
Rep: Reputation: 30
Okay, thanks. In an effort to reduce my learning curve, andbody feel up to showing me how I'd accomplish this in awk?

If you'd like to see what the html page I'm stripping looks like, here's an example page (it's a search from dexonline.com)

http://dexonline.com/servlet/ActionS...dingAreas=true

Greatly appreciate it...

Last edited by jeffreybluml; 05-12-2005 at 08:15 AM.
 
Old 05-12-2005, 10:11 PM   #5
PTrenholme
Senior Member
 
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,149

Rep: Reputation: 330Reputation: 330Reputation: 330Reputation: 330
OK, here's an example:
Code:
$ cat tmp/test.awk
# Find the (next) listingname
/<span class=\"listingname\">/ {
# Get the name from the field. NOTE: Assumes the name is in the same physical line as "listingname"
    name = "--Unnamed--";
    if (match($0, /(>)([^<]+)(<\/span>)/, vals) > 0) {
      name = vals[2];}
# Start serching fo the 'phone number
    while (getline > 0) {
# No number found if we've hit the next "listingname"
      if ($0 ~ /<span class=\"listingname\">/ ) {
        print name ": No 'phone number listed";
        name = "--Unnamed--";
        if (match($0, /(>)([^<]+)(<\/span>)/, vals) > 0) {
          name = vals[2];}
        if (getline <= 0) {
          break;
        }
      }
# Look for the "bold"  delimiter followed by an open paren.  (This is, of course, a kludg.)
      if ($0 ~ /<b>\(/) {
#Break the line apart by the HTML delimiters
        FS="[<>]";
        $0 = $0;
# And the 'phone number should be the third field
        phone = (NF > 3) ? $3 : "Unpsecified";
        print name ": " phone;
# Restore the field seperator
        FS=" ";
# And go back to find the next listingname . . .
        break;
      }
    }
  }
And here's the output from your example html:
Code:
$ gawk -f tmp/test.awk Downloads/html/ActionServlet.html
Holiday/Taco Bell: (952) 758-5252
Taco Bell: (651) 982-0434
Taco Bell: (715) 386-3006
Taco Bell: (952) 470-8909
Taco Bell/Long John Silvers #21459: (763) 259-0762
Taco Bell: (763) 259-0762
Taco Bell: (651) 770-0978
Taco Bell: (952) 226-6210
Taco Bell: (763) 383-1103
TACO BELL: (763) 477-4096
Taco Bell: (952) 854-5255
Taco Bell: (952) 544-0128
Taco Bell Baker Center: (612) 359-9527
Taco Bell Express: (763) 542-9219
Taco Bell Express: (952) 888-6292
Taco Bell Restaurants: (763) 757-8976
Taco Bell Restaurants: (763) 502-0399
Taco Bell Restaurants: (952) 953-4553
Taco Bell Restaurants: (952) 233-1561
Taco Bell Restaurants: (952) 892-6670
$
 
Old 05-13-2005, 07:04 AM   #6
jeffreybluml
Member
 
Registered: Mar 2004
Location: Minnesota
Distribution: Fedora Core 1, Mandrake 10
Posts: 405

Original Poster
Rep: Reputation: 30
Really appreciate your time PTrenholme,

Couple questions, (sorry)

Foremost, and I realize this will show just how little I know, but what do I do with that code? I pasted it into a file I named "test" in the directory, made it executable, edited the first line to:

first: point to the phone.html doc
second: point to phone.awk, and copied phone.html oto phone.awk ( I know that was silly, grasping at straws already at this point)
third: removed the "$" at the beginning of the first line for both above examples

After each of these, I did:

./test

and more often than not it just spit out the html at me or gavbe me:

./test: line 1: $: command not found
./test: line 3: span: No such file or directory
./test: line 5: name: command not found
./test: line 6: syntax error near unexpected token `$0,'
./test: line 6: ` if (match($0, /(> )([^<]+)(<\/span> )/, vals) > 0) {'

So, I'm obviously not implementing this correctly, and I'm feeling a little stupid again.

Next, and I hope this doesn't make me sound greedy considering the work you've already done for me, but it there a way to do this that will capture the two lines above the phone number as well? I'd like to get the addresses returned as well. Preferaby it would then list the name, next two lines would be the address, and then the last line for each would be the phone number. The order is of the least importance, I'd like to get all the info to the page...

Again, thanks for helping me this far!

Jeff
 
Old 05-13-2005, 09:37 AM   #7
PTrenholme
Senior Member
 
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,149

Rep: Reputation: 330Reputation: 330Reputation: 330Reputation: 330
Let me think about the address -- it should be no problem.

As to the implementation, the lines with the "$" in them indicate the commands typed in the terminal window. (The $ is the command line prompt sequence normally used, or, at least, the last character of that sequence. What comes after it is the command you type. Conventionally, "$" indicates "user mode" whilst "#" denotes "superuser or root mode.")

So, go back and copy everything after the "cat" command ("cat" is Unix for what DOS chose to call "type") up to (but not including) the next "$" into a file called "test.awk," which is not an executable file, just a regular text file.

Then enter the "gawk -f test.awk <your html file>" command. Again, do this from a command prompt in a terminal window. (What you're doing is starting the gawk interpreter with the file [-f] "test.awk" of commands, and applying those commands to the last argument. (Obviously [I hope], the awk file name and extension are entirely up to you, as is the last file name. The .awk extension on the test files is just a convention.)
 
Old 05-13-2005, 10:16 AM   #8
jeffreybluml
Member
 
Registered: Mar 2004
Location: Minnesota
Distribution: Fedora Core 1, Mandrake 10
Posts: 405

Original Poster
Rep: Reputation: 30
Wheeeeeeeeeeeeeeee!!!!!!!!!!!!!!!!!!!!!!!!!!!

I think I've got it!!

It's kind of sloppy again, but this seems to do what I want...

Code:
#!/bin/bash
sudo tail -n 20 /var/log/httpd/access_log | grep  servlet | sed 's/.*GET/http\:\/\/dexonline.com/' | sed 's/HTTP\/1.1\"\ 404\ 573//' | sed 's/\ //g' > /var/www/html/files/address;
j=1; for i in $(cat /var/www/html/files/address); do address=$i; wget -o /var/www/html/wget_log_phonesearch -O /var/www/html/files/phone.html -r $i; j=$(($j+1)); done;
grep -m 30 listingname /var/www/html/files/phone.html | sed 's/<span\ class=\"listingname\">//' | sed 's/<\/span>/<br\/>/' > /var/www/html/files/names;
grep -m 30 -B 2 '([[:alnum:]][[:alnum:]][[:alnum:]]) [[:alnum:]][[:alnum:]][[:alnum:]]-[[:alnum:]][[:alnum:]][[:alnum:]][[:alnum:]]' /var/www/html/files/phone.html | sed 's/<br/<br\//g' | sed 's/--/<br\/>/g' | sed 's/&\ //g' > /var/www/html/files/numbers;
echo "" > /var/www/html/files/tempnames;
echo "" > /var/www/html/files/results.inc;
g=1;l=4;
until [ $g -eq 30 ];
 do head -n $g /var/www/html/files/names >> /var/www/html/files/tempnames; tail -n 1 /var/www/html/files/tempnames >> /var/www/html/files/results.inc; head -n $l /var/www/html/files/numbers | tail -n 4 >> /var/www/html/files/results.inc; l=$(($l+4));
let "g = $g + 1";
done;
This properly returns all the listings, preceded by their correct name. Woohoo!!!!

Thanks again for the help. I feel bad for not using the awk method you spent time coming up with, but I just couldn't stop tinkering around with grep, and then I had a moment of clarity and - poof! - I had it right.


Still wish there was a way to get rid of the step wherein I request the non-existant URL from my server in order to get said URL as a variable for wget...any expertise there?

Thanks again...

Last edited by jeffreybluml; 05-13-2005 at 10:18 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
how can I force a file to be treated like a HD? eantoranz Linux - Hardware 4 05-08-2005 02:59 PM
Force grep NOT to print file names smart_sagittari Linux - Newbie 5 04-25-2005 01:20 AM
grep throughout the whole file system....? vous Linux - Software 4 03-20-2005 01:30 PM
pronounce 'iteration' Ikebo General 5 09-28-2004 07:25 PM
grep file in the subdirectory juno Linux - General 3 09-30-2002 11:08 AM


All times are GMT -5. The time now is 10:41 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration