Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
05-12-2005, 07:51 AM
|
#1
|
Member
Registered: Mar 2004
Location: Minnesota
Distribution: Fedora Core 1, Mandrake 10
Posts: 405
Rep:
|
force grep to keep it's place in file for next iteration?
Hmmm, not sure how to put this...
I have a large html file with multiple phone number results in it. For each phone number listing, the name (ie Taco Bell) is seperated by many lines of html code before the corresponding phone number. So, I'm grepping through it and trying to grab each name, then grab the phone number, the the next name, then phone, etc...
I'm doing this is an until loop, and currently grep grabs the exact same entry each interation.
How can I make grep resume looking where it left off, or ignore the previously found entries?
Also, can I get grep to ignore duplicate entries? I swear I saw something about duplicate lines while reading the man page, but I can't find it now...
Here's the code:
Code:
x=1;
until [ $x -eq 30 ];
do
let "x = $x + 1";
grep -m 1 -B 3 '([[:alnum:]][[:alnum:]][[:alnum:]]) [[:alnum:]][[:alnum:]][[:alnum:]]-[[:alnum:]][[:alnum:]][[:alnum:]][[:alnum:]]' /var/www/html/files/phone.html | sed 's/<br/<br\//g' | sed 's/--/<br\/>/g' | sed 's/&\ //
g' >> /var/www/html/files/results.inc && grep -m 1 listingname /var/www/html/files/phone.html | sed 's/<span\ class=\"listingname\">//' | sed 's/<\/span>/<br\/>/' >> /var/www/html/files/results.inc;
done;
Any suggestions?
Thanks as always!
Last edited by jeffreybluml; 05-12-2005 at 07:53 AM.
|
|
|
05-12-2005, 07:59 AM
|
#2
|
Senior Member
Registered: Jul 2004
Location: France
Distribution: Arch Linux
Posts: 1,897
Rep:
|
I've always found that unix tools are hard to use and unreliable at best, where XML-related files are concerned (such as HTML). So:
- if your document is a well-formed XML document (eg: a XHTML doc), then you might prefer using a XML parser, or something along the lines of my xpathRead ( http://yves.gablin.club.fr/pc/www.php?lang=fr (French)).
- else you will probably have to pre-process your file, so that it becomes more adapted to Unix tools.
Yves
|
|
|
05-12-2005, 08:06 AM
|
#3
|
Senior Member
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,187
|
Try one of the other text processing tools. I'd suggest either gawk or perl. AWK is probably easier . . .
|
|
|
05-12-2005, 08:13 AM
|
#4
|
Member
Registered: Mar 2004
Location: Minnesota
Distribution: Fedora Core 1, Mandrake 10
Posts: 405
Original Poster
Rep:
|
Okay, thanks. In an effort to reduce my learning curve, andbody feel up to showing me how I'd accomplish this in awk?
If you'd like to see what the html page I'm stripping looks like, here's an example page (it's a search from dexonline.com)
http://dexonline.com/servlet/ActionS...dingAreas=true
Greatly appreciate it...
Last edited by jeffreybluml; 05-12-2005 at 08:15 AM.
|
|
|
05-12-2005, 10:11 PM
|
#5
|
Senior Member
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,187
|
OK, here's an example:
Code:
$ cat tmp/test.awk
# Find the (next) listingname
/<span class=\"listingname\">/ {
# Get the name from the field. NOTE: Assumes the name is in the same physical line as "listingname"
name = "--Unnamed--";
if (match($0, /(>)([^<]+)(<\/span>)/, vals) > 0) {
name = vals[2];}
# Start serching fo the 'phone number
while (getline > 0) {
# No number found if we've hit the next "listingname"
if ($0 ~ /<span class=\"listingname\">/ ) {
print name ": No 'phone number listed";
name = "--Unnamed--";
if (match($0, /(>)([^<]+)(<\/span>)/, vals) > 0) {
name = vals[2];}
if (getline <= 0) {
break;
}
}
# Look for the "bold" delimiter followed by an open paren. (This is, of course, a kludg.)
if ($0 ~ /<b>\(/) {
#Break the line apart by the HTML delimiters
FS="[<>]";
$0 = $0;
# And the 'phone number should be the third field
phone = (NF > 3) ? $3 : "Unpsecified";
print name ": " phone;
# Restore the field seperator
FS=" ";
# And go back to find the next listingname . . .
break;
}
}
}
And here's the output from your example html:
Code:
$ gawk -f tmp/test.awk Downloads/html/ActionServlet.html
Holiday/Taco Bell: (952) 758-5252
Taco Bell: (651) 982-0434
Taco Bell: (715) 386-3006
Taco Bell: (952) 470-8909
Taco Bell/Long John Silvers #21459: (763) 259-0762
Taco Bell: (763) 259-0762
Taco Bell: (651) 770-0978
Taco Bell: (952) 226-6210
Taco Bell: (763) 383-1103
TACO BELL: (763) 477-4096
Taco Bell: (952) 854-5255
Taco Bell: (952) 544-0128
Taco Bell Baker Center: (612) 359-9527
Taco Bell Express: (763) 542-9219
Taco Bell Express: (952) 888-6292
Taco Bell Restaurants: (763) 757-8976
Taco Bell Restaurants: (763) 502-0399
Taco Bell Restaurants: (952) 953-4553
Taco Bell Restaurants: (952) 233-1561
Taco Bell Restaurants: (952) 892-6670
$
|
|
|
05-13-2005, 07:04 AM
|
#6
|
Member
Registered: Mar 2004
Location: Minnesota
Distribution: Fedora Core 1, Mandrake 10
Posts: 405
Original Poster
Rep:
|
Really appreciate your time PTrenholme,
Couple questions, (sorry)
Foremost, and I realize this will show just how little I know, but what do I do with that code? I pasted it into a file I named "test" in the directory, made it executable, edited the first line to:
first: point to the phone.html doc
second: point to phone.awk, and copied phone.html oto phone.awk ( I know that was silly, grasping at straws already at this point)
third: removed the "$" at the beginning of the first line for both above examples
After each of these, I did:
./test
and more often than not it just spit out the html at me or gavbe me:
./test: line 1: $: command not found
./test: line 3: span: No such file or directory
./test: line 5: name: command not found
./test: line 6: syntax error near unexpected token `$0,'
./test: line 6: ` if (match($0, /(> )([^<]+)(<\/span> )/, vals) > 0) {'
So, I'm obviously not implementing this correctly, and I'm feeling a little stupid again.
Next, and I hope this doesn't make me sound greedy considering the work you've already done for me, but it there a way to do this that will capture the two lines above the phone number as well? I'd like to get the addresses returned as well. Preferaby it would then list the name, next two lines would be the address, and then the last line for each would be the phone number. The order is of the least importance, I'd like to get all the info to the page...
Again, thanks for helping me this far!
Jeff
|
|
|
05-13-2005, 09:37 AM
|
#7
|
Senior Member
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,187
|
Let me think about the address -- it should be no problem.
As to the implementation, the lines with the "$" in them indicate the commands typed in the terminal window. (The $ is the command line prompt sequence normally used, or, at least, the last character of that sequence. What comes after it is the command you type. Conventionally, "$" indicates "user mode" whilst "#" denotes "superuser or root mode.")
So, go back and copy everything after the "cat" command ("cat" is Unix for what DOS chose to call "type") up to (but not including) the next "$" into a file called "test.awk," which is not an executable file, just a regular text file.
Then enter the "gawk -f test.awk <your html file>" command. Again, do this from a command prompt in a terminal window. (What you're doing is starting the gawk interpreter with the file [-f] "test.awk" of commands, and applying those commands to the last argument. (Obviously [I hope], the awk file name and extension are entirely up to you, as is the last file name. The .awk extension on the test files is just a convention.)
|
|
|
05-13-2005, 10:16 AM
|
#8
|
Member
Registered: Mar 2004
Location: Minnesota
Distribution: Fedora Core 1, Mandrake 10
Posts: 405
Original Poster
Rep:
|
Wheeeeeeeeeeeeeeee!!!!!!!!!!!!!!!!!!!!!!!!!!!
I think I've got it!!
It's kind of sloppy again, but this seems to do what I want...
Code:
#!/bin/bash
sudo tail -n 20 /var/log/httpd/access_log | grep servlet | sed 's/.*GET/http\:\/\/dexonline.com/' | sed 's/HTTP\/1.1\"\ 404\ 573//' | sed 's/\ //g' > /var/www/html/files/address;
j=1; for i in $(cat /var/www/html/files/address); do address=$i; wget -o /var/www/html/wget_log_phonesearch -O /var/www/html/files/phone.html -r $i; j=$(($j+1)); done;
grep -m 30 listingname /var/www/html/files/phone.html | sed 's/<span\ class=\"listingname\">//' | sed 's/<\/span>/<br\/>/' > /var/www/html/files/names;
grep -m 30 -B 2 '([[:alnum:]][[:alnum:]][[:alnum:]]) [[:alnum:]][[:alnum:]][[:alnum:]]-[[:alnum:]][[:alnum:]][[:alnum:]][[:alnum:]]' /var/www/html/files/phone.html | sed 's/<br/<br\//g' | sed 's/--/<br\/>/g' | sed 's/&\ //g' > /var/www/html/files/numbers;
echo "" > /var/www/html/files/tempnames;
echo "" > /var/www/html/files/results.inc;
g=1;l=4;
until [ $g -eq 30 ];
do head -n $g /var/www/html/files/names >> /var/www/html/files/tempnames; tail -n 1 /var/www/html/files/tempnames >> /var/www/html/files/results.inc; head -n $l /var/www/html/files/numbers | tail -n 4 >> /var/www/html/files/results.inc; l=$(($l+4));
let "g = $g + 1";
done;
This properly returns all the listings, preceded by their correct name. Woohoo!!!!
Thanks again for the help. I feel bad for not using the awk method you spent time coming up with, but I just couldn't stop tinkering around with grep, and then I had a moment of clarity and - poof! - I had it right.
Still wish there was a way to get rid of the step wherein I request the non-existant URL from my server in order to get said URL as a variable for wget...any expertise there?
Thanks again...
Last edited by jeffreybluml; 05-13-2005 at 10:18 AM.
|
|
|
All times are GMT -5. The time now is 10:31 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|