force grep to keep it's place in file for next iteration?
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
force grep to keep it's place in file for next iteration?
Hmmm, not sure how to put this...
I have a large html file with multiple phone number results in it. For each phone number listing, the name (ie Taco Bell) is seperated by many lines of html code before the corresponding phone number. So, I'm grepping through it and trying to grab each name, then grab the phone number, the the next name, then phone, etc...
I'm doing this is an until loop, and currently grep grabs the exact same entry each interation.
How can I make grep resume looking where it left off, or ignore the previously found entries?
Also, can I get grep to ignore duplicate entries? I swear I saw something about duplicate lines while reading the man page, but I can't find it now...
Here's the code:
Code:
x=1;
until [ $x -eq 30 ];
do
let "x = $x + 1";
grep -m 1 -B 3 '([[:alnum:]][[:alnum:]][[:alnum:]]) [[:alnum:]][[:alnum:]][[:alnum:]]-[[:alnum:]][[:alnum:]][[:alnum:]][[:alnum:]]' /var/www/html/files/phone.html | sed 's/<br/<br\//g' | sed 's/--/<br\/>/g' | sed 's/&\ //
g' >> /var/www/html/files/results.inc && grep -m 1 listingname /var/www/html/files/phone.html | sed 's/<span\ class=\"listingname\">//' | sed 's/<\/span>/<br\/>/' >> /var/www/html/files/results.inc;
done;
Any suggestions?
Thanks as always!
Last edited by jeffreybluml; 05-12-2005 at 07:53 AM.
I've always found that unix tools are hard to use and unreliable at best, where XML-related files are concerned (such as HTML). So:
- if your document is a well-formed XML document (eg: a XHTML doc), then you might prefer using a XML parser, or something along the lines of my xpathRead (http://yves.gablin.club.fr/pc/www.php?lang=fr (French)).
- else you will probably have to pre-process your file, so that it becomes more adapted to Unix tools.
$ cat tmp/test.awk
# Find the (next) listingname
/<span class=\"listingname\">/ {
# Get the name from the field. NOTE: Assumes the name is in the same physical line as "listingname"
name = "--Unnamed--";
if (match($0, /(>)([^<]+)(<\/span>)/, vals) > 0) {
name = vals[2];}
# Start serching fo the 'phone number
while (getline > 0) {
# No number found if we've hit the next "listingname"
if ($0 ~ /<span class=\"listingname\">/ ) {
print name ": No 'phone number listed";
name = "--Unnamed--";
if (match($0, /(>)([^<]+)(<\/span>)/, vals) > 0) {
name = vals[2];}
if (getline <= 0) {
break;
}
}
# Look for the "bold" delimiter followed by an open paren. (This is, of course, a kludg.)
if ($0 ~ /<b>\(/) {
#Break the line apart by the HTML delimiters
FS="[<>]";
$0 = $0;
# And the 'phone number should be the third field
phone = (NF > 3) ? $3 : "Unpsecified";
print name ": " phone;
# Restore the field seperator
FS=" ";
# And go back to find the next listingname . . .
break;
}
}
}
Foremost, and I realize this will show just how little I know, but what do I do with that code? I pasted it into a file I named "test" in the directory, made it executable, edited the first line to:
first: point to the phone.html doc
second: point to phone.awk, and copied phone.html oto phone.awk ( I know that was silly, grasping at straws already at this point)
third: removed the "$" at the beginning of the first line for both above examples
After each of these, I did:
./test
and more often than not it just spit out the html at me or gavbe me:
./test: line 1: $: command not found
./test: line 3: span: No such file or directory
./test: line 5: name: command not found
./test: line 6: syntax error near unexpected token `$0,'
./test: line 6: ` if (match($0, /(> )([^<]+)(<\/span> )/, vals) > 0) {'
So, I'm obviously not implementing this correctly, and I'm feeling a little stupid again.
Next, and I hope this doesn't make me sound greedy considering the work you've already done for me, but it there a way to do this that will capture the two lines above the phone number as well? I'd like to get the addresses returned as well. Preferaby it would then list the name, next two lines would be the address, and then the last line for each would be the phone number. The order is of the least importance, I'd like to get all the info to the page...
Let me think about the address -- it should be no problem.
As to the implementation, the lines with the "$" in them indicate the commands typed in the terminal window. (The $ is the command line prompt sequence normally used, or, at least, the last character of that sequence. What comes after it is the command you type. Conventionally, "$" indicates "user mode" whilst "#" denotes "superuser or root mode.")
So, go back and copy everything after the "cat" command ("cat" is Unix for what DOS chose to call "type") up to (but not including) the next "$" into a file called "test.awk," which is not an executable file, just a regular text file.
Then enter the "gawk -f test.awk <your html file>" command. Again, do this from a command prompt in a terminal window. (What you're doing is starting the gawk interpreter with the file [-f] "test.awk" of commands, and applying those commands to the last argument. (Obviously [I hope], the awk file name and extension are entirely up to you, as is the last file name. The .awk extension on the test files is just a convention.)
It's kind of sloppy again, but this seems to do what I want...
Code:
#!/bin/bash
sudo tail -n 20 /var/log/httpd/access_log | grep servlet | sed 's/.*GET/http\:\/\/dexonline.com/' | sed 's/HTTP\/1.1\"\ 404\ 573//' | sed 's/\ //g' > /var/www/html/files/address;
j=1; for i in $(cat /var/www/html/files/address); do address=$i; wget -o /var/www/html/wget_log_phonesearch -O /var/www/html/files/phone.html -r $i; j=$(($j+1)); done;
grep -m 30 listingname /var/www/html/files/phone.html | sed 's/<span\ class=\"listingname\">//' | sed 's/<\/span>/<br\/>/' > /var/www/html/files/names;
grep -m 30 -B 2 '([[:alnum:]][[:alnum:]][[:alnum:]]) [[:alnum:]][[:alnum:]][[:alnum:]]-[[:alnum:]][[:alnum:]][[:alnum:]][[:alnum:]]' /var/www/html/files/phone.html | sed 's/<br/<br\//g' | sed 's/--/<br\/>/g' | sed 's/&\ //g' > /var/www/html/files/numbers;
echo "" > /var/www/html/files/tempnames;
echo "" > /var/www/html/files/results.inc;
g=1;l=4;
until [ $g -eq 30 ];
do head -n $g /var/www/html/files/names >> /var/www/html/files/tempnames; tail -n 1 /var/www/html/files/tempnames >> /var/www/html/files/results.inc; head -n $l /var/www/html/files/numbers | tail -n 4 >> /var/www/html/files/results.inc; l=$(($l+4));
let "g = $g + 1";
done;
This properly returns all the listings, preceded by their correct name. Woohoo!!!!
Thanks again for the help. I feel bad for not using the awk method you spent time coming up with, but I just couldn't stop tinkering around with grep, and then I had a moment of clarity and - poof! - I had it right.
Still wish there was a way to get rid of the step wherein I request the non-existant URL from my server in order to get said URL as a variable for wget...any expertise there?
Thanks again...
Last edited by jeffreybluml; 05-13-2005 at 10:18 AM.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.