LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-25-2011, 09:30 AM   #1
druisgod
Member
 
Registered: Jun 2004
Location: Maine
Distribution: LFS Mint OS, LFS, CENTos,
Posts: 119

Rep: Reputation: 18
parsing out squid access log with awk and grep


I'm trying to recreate a simple script I wrote to parse out the access.log to get a rough idea of websites that users are going to on our corp network. The issue I'm having is I want to pull out any line from access.log that ends in .com/ .org/ .net/ or whatever to only see what the user entered into the address bar and drop pictures, js's and everything else and log only this.

so what I do is :
awk '{print $8} | grep -e '[cong]|[ore]|[mgtv][/]'$ and nothing happens.

I know there is an easier way to do this with awk alone, . . . anyone?

Thx
 
Old 04-25-2011, 09:55 AM   #2
kurumi
Member
 
Registered: Apr 2010
Posts: 228

Rep: Reputation: 53
Code:
ruby -ne 'print if /\.(com|net|org)$/' access.log
 
0 members found this post helpful.
Old 04-25-2011, 10:09 AM   #3
druisgod
Member
 
Registered: Jun 2004
Location: Maine
Distribution: LFS Mint OS, LFS, CENTos,
Posts: 119

Original Poster
Rep: Reputation: 18
Quote:
Originally Posted by kurumi View Post
Code:
ruby -ne 'print if /\.(com|net|org)$/' access.log
Didn't work. It doesn't output anything.

Here's the line that is being spit out from the access.log after I awk out the 7th field:

When I run the Ruby code, it prints out nothing to std out. I'm not familiar with Ruby, I assume it is acting like awk(?)

Last edited by druisgod; 04-25-2011 at 10:13 AM. Reason: more info
 
Old 04-25-2011, 07:11 PM   #4
kurumi
Member
 
Registered: Apr 2010
Posts: 228

Rep: Reputation: 53
Quote:
Originally Posted by druisgod View Post
Didn't work. It doesn't output anything.

Here's the line that is being spit out from the access.log after I awk out the 7th field:



When I run the Ruby code, it prints out nothing to std out. I'm not familiar with Ruby, I assume it is acting like awk(?)
if the lines you want ends in a slash, eg .com/, then add "\/" to the regex
Code:
ruby -ne 'print if /\.(com|net|org)\/$/' access.log
 
Old 04-26-2011, 01:28 AM   #5
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
The problem in the grep command
Code:
grep -e '[cong]|[ore]|[mgtv][/]'
is that the -e option does not use extended regular expressions, so that the pipe symbol is interpreted literally. Maybe you want the -E (uppercase e) option. Indeed using awk you don't really need to pipe the results into grep:
Code:
awk '$8 ~ /\.com|\.net|\.org/{print $8}'
which is similar to the ruby code suggested by kurumi!

Last edited by colucix; 04-26-2011 at 01:51 AM. Reason: syntax
 
Old 04-26-2011, 02:33 AM   #6
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,999

Rep: Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190
So I am confused on 2 fronts here (easily done some times):

1. You start by referring to field 8 but then in post #3 you talk about the 7th field?

2. You state the following:
Quote:
user entered into the address bar and drop pictures, js's and everything else
But again in post #3 your output shows:
Maybe you could show some of the log so we can ascertain exactly which field you are referring to and then which part of that field are you interested in?
 
Old 04-26-2011, 06:33 AM   #7
druisgod
Member
 
Registered: Jun 2004
Location: Maine
Distribution: LFS Mint OS, LFS, CENTos,
Posts: 119

Original Poster
Rep: Reputation: 18
Guys, I appreciate all of the help! I'm sorry this has been a bit of a flustercluck from the beginning. I have solved it, and with your help! The access.log from squid looks like this:
Code:
1303632736.387    121 192.168.4.12 TCP_MISS/200 537 GET http://packages.linuxmint.com/dists/julia/Release.gpg - DIRECT/80.86.
83.193 application/octet-stream
1303632736.501    249 192.168.4.12 TCP_REFRESH_HIT/304 304 GET http://archive.canonical.com/ubuntu/dists/maverick/Release.gpg
 - DIRECT/91.189.88.33 -
1303632736.515    246 192.168.4.12 TCP_REFRESH_HIT/304 405 GET http://security.ubuntu.com/ubuntu/dists/maverick-security/Rele
ase.gpg - DIRECT/91.189.92.166 -
1303632736.517    129 192.168.4.12 TCP_MISS/404 648 GET http://packages.linuxmint.com/dists/julia/import/i18n/Translation-en.
bz2 - DIRECT/80.86.83.193 text/html
1303632736.520    249 192.168.4.12 TCP_REFRESH_HIT/304 397 GET http://archive.ubuntu.com/ubuntu/dists/maverick/Release.gpg - 
DIRECT/91.189.88.46 -
1303632736.545    275 192.168.4.12 TCP_REFRESH_HIT/304 308 GET http://packages.medibuntu.org/dists/maverick/Release.gpg - DIR
ECT/88.191.127.22 -
1303632736.613    112 192.168.4.12 TCP_MISS/404 666 GET http://archive.canonical.com/ubuntu/dists/maverick/partner/i18n/Trans
lation-en.bz2 - DIRECT/91.189.88.33 text/html
for example and in this part of my script, I'm parsing out the url, which through test awk considers field 8, by my count it's 7, but that was my confusion.So what I am doing is trying to get a rough listing of websites that the users have entered into the address bar and drop everything else that squid logs, ie images, scripts, and redirects. Because this is a corporate network, I want to ensure that the user doesn't purposefully head to sites that aren't allowed at work, ie warez, pornography, etc. I don't fault anyone for being accidentally redirected, and yes this can be subverted if a user goes to google to search out porn or whatever, but this is an attempt of a bit of "loose" security. I do have Dansguardian installed and I havent decided to sit down and tweak the thing until its perfect, I imagine its a long and drawn out process, and I'd rather not deal with the multiple calls of "Hey I went here and I got this message". So this script pulls this stuff so I can then pull the data out by IP and logrotate the results using cron daily.

I appreciate the help guys! I've been avoiding regex, awk, and sed for a while now, only using it minimally, and unfortunatly, I get confused.

Thanks again!
 
Old 04-26-2011, 06:36 AM   #8
druisgod
Member
 
Registered: Jun 2004
Location: Maine
Distribution: LFS Mint OS, LFS, CENTos,
Posts: 119

Original Poster
Rep: Reputation: 18
Quote:
Originally Posted by grail View Post
So I am confused on 2 fronts here (easily done some times):

1. You start by referring to field 8 but then in post #3 you talk about the 7th field?

2. You state the following:

But again in post #3 your output shows:


Maybe you could show some of the log so we can ascertain exactly which field you are referring to and then which part of that field are you interested in?
Those results were "post awk, pre-grep" so After I used awk to print out the desired field from the log, I was to grep out only things that ended in webadress.com/ and drop all webaddress.com/picone.jpg and webaddress.com/scriptone.js. Squid logs not only the site the user is going to, but everything that gets loaded as well. Thank you!
 
Old 04-26-2011, 10:40 AM   #9
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,999

Rep: Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190Reputation: 3190
Well there is no need to use grep if you are also using awk as it has most of the functionality built in as well.
colucix's example is the one I would go for, if using awk, although it will give you the entire field and your explanation still has me asking
if you want it all or just up until .com, .net, etc, ie. up until the first slash after http://.
If the above is desired, you could easily use split:
Code:
awk '$7 ~ /\.(com|net|org)/{split($7,f,"/");print f[3]}' file
This is based on your example above.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[Grep,Awk,Sed]Parsing text between XML tags. ////// Programming 5 07-26-2011 11:54 AM
squid /var/log/squid/access.log problems fahadabdillahi Linux - Server 0 12-17-2010 01:10 AM
Parsing log file with awk sebelk Programming 1 08-31-2009 08:47 AM
Can SQUID log skype calls,voip,chat programs in access.log revinking Linux - Newbie 6 07-27-2008 01:14 PM
My squid won't fill /var/log/squid/access.log linuxlah Linux - General 5 10-06-2003 10:51 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:49 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration