LinuxQuestions.org - [SOLVED] parsing out squid access log with awk and grep

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - parsing out squid access log with awk and grep (https://www.linuxquestions.org/questions/programming-9/parsing-out-squid-access-log-with-awk-and-grep-876970/)

parsing out squid access log with awk and grep

I'm trying to recreate a simple script I wrote to parse out the access.log to get a rough idea of websites that users are going to on our corp network. The issue I'm having is I want to pull out any line from access.log that ends in .com/ .org/ .net/ or whatever to only see what the user entered into the address bar and drop pictures, js's and everything else and log only this.

so what I do is :
awk '{print $8} | grep -e '[cong]|[ore]|[mgtv][/]'$ and nothing happens.

I know there is an easier way to do this with awk alone, . . . anyone?

Thx

Code:

ruby -ne 'print if /\.(com|net|org)$/' access.log

Quote:

Originally Posted by kurumi (Post 4335632)

Code:

ruby -ne 'print if /\.(com|net|org)$/' access.log

Didn't work. :( It doesn't output anything.

Here's the line that is being spit out from the access.log after I awk out the 7th field:

Quote:

http://google.com/
http://www.google.com/
http://www.google.com/images/srpr/nav_logo66.png
http://www.google.com/extern_js/f/CgJlbhICdXMrMEBk.js
http://www.google.com/extern_chrome/d5ae566c3d2b6958.js
http://www.google.com/gen_204?
http://clients1.google.com/generate_204
http://www.google.com/csi?
http://ssl.gstatic.com/gb/js/sem_bc5...51d64763093.js

When I run the Ruby code, it prints out nothing to std out. I'm not familiar with Ruby, I assume it is acting like awk(?)

Quote:

Originally Posted by druisgod (Post 4335651)

Didn't work. :( It doesn't output anything.

Here's the line that is being spit out from the access.log after I awk out the 7th field:

When I run the Ruby code, it prints out nothing to std out. I'm not familiar with Ruby, I assume it is acting like awk(?)

if the lines you want ends in a slash, eg .com/, then add "\/" to the regex

Code:

ruby -ne 'print if /\.(com|net|org)\/$/' access.log

The problem in the grep command

Code:

grep -e '[cong]|[ore]|[mgtv][/]'

is that the -e option does not use extended regular expressions, so that the pipe symbol is interpreted literally. Maybe you want the -E (uppercase e) option. Indeed using awk you don't really need to pipe the results into grep:

Code:

awk '$8 ~ /\.com|\.net|\.org/{print $8}'

which is similar to the ruby code suggested by kurumi! ;)

So I am confused on 2 fronts here (easily done some times):

1. You start by referring to field 8 but then in post #3 you talk about the 7th field?

2. You state the following:

Quote:

user entered into the address bar and drop pictures, js's and everything else

But again in post #3 your output shows:

Quote:

http://www.google.com/images/srpr/nav_logo66.png
http://www.google.com/extern_js/f/CgJlbhICdXMrMEBk.js

Maybe you could show some of the log so we can ascertain exactly which field you are referring to and then which part of that field are you interested in?

Guys, I appreciate all of the help! I'm sorry this has been a bit of a flustercluck from the beginning. I have solved it, and with your help! The access.log from squid looks like this:

Code:

1303632736.387    121 192.168.4.12 TCP_MISS/200 537 GET http://packages.linuxmint.com/dists/julia/Release.gpg - DIRECT/80.86.

83.193 application/octet-stream

1303632736.501    249 192.168.4.12 TCP_REFRESH_HIT/304 304 GET http://archive.canonical.com/ubuntu/dists/maverick/Release.gpg

 - DIRECT/91.189.88.33 -

1303632736.515    246 192.168.4.12 TCP_REFRESH_HIT/304 405 GET http://security.ubuntu.com/ubuntu/dists/maverick-security/Rele

ase.gpg - DIRECT/91.189.92.166 -

1303632736.517    129 192.168.4.12 TCP_MISS/404 648 GET http://packages.linuxmint.com/dists/julia/import/i18n/Translation-en.

bz2 - DIRECT/80.86.83.193 text/html

1303632736.520    249 192.168.4.12 TCP_REFRESH_HIT/304 397 GET http://archive.ubuntu.com/ubuntu/dists/maverick/Release.gpg - 

DIRECT/91.189.88.46 -

1303632736.545    275 192.168.4.12 TCP_REFRESH_HIT/304 308 GET http://packages.medibuntu.org/dists/maverick/Release.gpg - DIR

ECT/88.191.127.22 -

1303632736.613    112 192.168.4.12 TCP_MISS/404 666 GET http://archive.canonical.com/ubuntu/dists/maverick/partner/i18n/Trans

lation-en.bz2 - DIRECT/91.189.88.33 text/html

for example and in this part of my script, I'm parsing out the url, which through test awk considers field 8, by my count it's 7, but that was my confusion.So what I am doing is trying to get a rough listing of websites that the users have entered into the address bar and drop everything else that squid logs, ie images, scripts, and redirects. Because this is a corporate network, I want to ensure that the user doesn't purposefully head to sites that aren't allowed at work, ie warez, pornography, etc. I don't fault anyone for being accidentally redirected, and yes this can be subverted if a user goes to google to search out porn or whatever, but this is an attempt of a bit of "loose" security. I do have Dansguardian installed and I havent decided to sit down and tweak the thing until its perfect, I imagine its a long and drawn out process, and I'd rather not deal with the multiple calls of "Hey I went here and I got this message". So this script pulls this stuff so I can then pull the data out by IP and logrotate the results using cron daily.

I appreciate the help guys! I've been avoiding regex, awk, and sed for a while now, only using it minimally, and unfortunatly, I get confused.

Thanks again!

Quote:

Originally Posted by grail (Post 4336474)

So I am confused on 2 fronts here (easily done some times):

1. You start by referring to field 8 but then in post #3 you talk about the 7th field?

2. You state the following:

But again in post #3 your output shows:

Maybe you could show some of the log so we can ascertain exactly which field you are referring to and then which part of that field are you interested in?

Those results were "post awk, pre-grep" so After I used awk to print out the desired field from the log, I was to grep out only things that ended in webadress.com/ and drop all webaddress.com/picone.jpg and webaddress.com/scriptone.js. Squid logs not only the site the user is going to, but everything that gets loaded as well. Thank you!

Well there is no need to use grep if you are also using awk as it has most of the functionality built in as well.
colucix's example is the one I would go for, if using awk, although it will give you the entire field and your explanation still has me asking
if you want it all or just up until .com, .net, etc, ie. up until the first slash after http://.
If the above is desired, you could easily use split:

Code:

awk '$7 ~ /\.(com|net|org)/{split($7,f,"/");print f[3]}' file

This is based on your example above.