[SOLVED] parsing out squid access log with awk and grep
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I'm trying to recreate a simple script I wrote to parse out the access.log to get a rough idea of websites that users are going to on our corp network. The issue I'm having is I want to pull out any line from access.log that ends in .com/ .org/ .net/ or whatever to only see what the user entered into the address bar and drop pictures, js's and everything else and log only this.
so what I do is :
awk '{print $8} | grep -e '[cong]|[ore]|[mgtv][/]'$ and nothing happens.
I know there is an easier way to do this with awk alone, . . . anyone?
is that the -e option does not use extended regular expressions, so that the pipe symbol is interpreted literally. Maybe you want the -E (uppercase e) option. Indeed using awk you don't really need to pipe the results into grep:
Code:
awk '$8 ~ /\.com|\.net|\.org/{print $8}'
which is similar to the ruby code suggested by kurumi!
Last edited by colucix; 04-26-2011 at 01:51 AM.
Reason: syntax
Maybe you could show some of the log so we can ascertain exactly which field you are referring to and then which part of that field are you interested in?
Guys, I appreciate all of the help! I'm sorry this has been a bit of a flustercluck from the beginning. I have solved it, and with your help! The access.log from squid looks like this:
Code:
1303632736.387 121 192.168.4.12 TCP_MISS/200 537 GET http://packages.linuxmint.com/dists/julia/Release.gpg - DIRECT/80.86.
83.193 application/octet-stream
1303632736.501 249 192.168.4.12 TCP_REFRESH_HIT/304 304 GET http://archive.canonical.com/ubuntu/dists/maverick/Release.gpg
- DIRECT/91.189.88.33 -
1303632736.515 246 192.168.4.12 TCP_REFRESH_HIT/304 405 GET http://security.ubuntu.com/ubuntu/dists/maverick-security/Rele
ase.gpg - DIRECT/91.189.92.166 -
1303632736.517 129 192.168.4.12 TCP_MISS/404 648 GET http://packages.linuxmint.com/dists/julia/import/i18n/Translation-en.
bz2 - DIRECT/80.86.83.193 text/html
1303632736.520 249 192.168.4.12 TCP_REFRESH_HIT/304 397 GET http://archive.ubuntu.com/ubuntu/dists/maverick/Release.gpg -
DIRECT/91.189.88.46 -
1303632736.545 275 192.168.4.12 TCP_REFRESH_HIT/304 308 GET http://packages.medibuntu.org/dists/maverick/Release.gpg - DIR
ECT/88.191.127.22 -
1303632736.613 112 192.168.4.12 TCP_MISS/404 666 GET http://archive.canonical.com/ubuntu/dists/maverick/partner/i18n/Trans
lation-en.bz2 - DIRECT/91.189.88.33 text/html
for example and in this part of my script, I'm parsing out the url, which through test awk considers field 8, by my count it's 7, but that was my confusion.So what I am doing is trying to get a rough listing of websites that the users have entered into the address bar and drop everything else that squid logs, ie images, scripts, and redirects. Because this is a corporate network, I want to ensure that the user doesn't purposefully head to sites that aren't allowed at work, ie warez, pornography, etc. I don't fault anyone for being accidentally redirected, and yes this can be subverted if a user goes to google to search out porn or whatever, but this is an attempt of a bit of "loose" security. I do have Dansguardian installed and I havent decided to sit down and tweak the thing until its perfect, I imagine its a long and drawn out process, and I'd rather not deal with the multiple calls of "Hey I went here and I got this message". So this script pulls this stuff so I can then pull the data out by IP and logrotate the results using cron daily.
I appreciate the help guys! I've been avoiding regex, awk, and sed for a while now, only using it minimally, and unfortunatly, I get confused.
So I am confused on 2 fronts here (easily done some times):
1. You start by referring to field 8 but then in post #3 you talk about the 7th field?
2. You state the following:
But again in post #3 your output shows:
Maybe you could show some of the log so we can ascertain exactly which field you are referring to and then which part of that field are you interested in?
Those results were "post awk, pre-grep" so After I used awk to print out the desired field from the log, I was to grep out only things that ended in webadress.com/ and drop all webaddress.com/picone.jpg and webaddress.com/scriptone.js. Squid logs not only the site the user is going to, but everything that gets loaded as well. Thank you!
Well there is no need to use grep if you are also using awk as it has most of the functionality built in as well.
colucix's example is the one I would go for, if using awk, although it will give you the entire field and your explanation still has me asking
if you want it all or just up until .com, .net, etc, ie. up until the first slash after http://.
If the above is desired, you could easily use split:
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.