LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
LinkBack Search this Thread
Old 08-15-2007, 12:55 AM   #1
Micro420
Senior Member
 
Registered: Aug 2003
Location: Berkeley, CA
Distribution: Mac OS X Leopard 10.6.2, Windows 2003 Server/Vista/7/XP/2000/NT/98, Ubuntux64, CentOS4.8/5.4
Posts: 2,986

Rep: Reputation: 45
Filtering a CSV file from web log with shell script?


I have been working on this project on and off for work for a few months and it's driving me nuts!

I basically have a semi-consistent CSV file from a Microsoft IIS web log. Here is a small snippet of what the log file looks like:
Code:
2007-01-01,23:30:00,/main/page-1.html,192.168.1.10,,
2007-01-01,23:31:00,/main/page-2.html,192.168.1.10,john@doe.com
2007-01-01,23:32:00,/main/page-3.html,192.168.1.10,
2007-01-01,23:32:10,/main/page-4.html,192.168.1.10,
2007-01-01,23:32:15,/main/page-5.html,192.168.1.10,
2007-01-01,23:35:00,/main/intro.html,192.168.1.25,,
2007-01-01,23:36:00,/main/blah.html,192.168.1.25,,
2007-01-01,23:37:00,/main/blah2.html,192.168.1.25,,
2007-01-01,23:39:00,/main/page-1.html,192.168.1.40,
2007-01-01,23:40:00,/main/page-2.html,192.168.1.40,
2007-01-01,23:41:00,/main/page-3.html,192.168.1.40,,,,blah,blah, jane@doe.com
2007-01-01,23:42:00,/main/blah3.html,192.168.1.25,,
2007-01-01,23:43:00,/main/page-4.html,192.168.1.40,
2007-01-01,23:44:00,/main/page-3.html,192.168.1.10,,,,,jack@doe.com
2007-01-01,23:45:00,/main/page-5.html,192.168.1.10,
2007-01-01,23:46:00,/main/page-9.html,192.168.1.10,,jack@doe.com
What I would like to see is this:
Code:
2007-01-01,23:30:00,/main/page-1.html,192.168.1.10,john@doe.com
2007-01-01,23:32:15,/main/page-5.html,192.168.1.10,john@doe.com
2007-01-01,23:39:00,/main/page-1.html,192.168.1.40,jane@doe.com
2007-01-01,23:43:00,/main/page-4.html,192.168.1.40,jane@doe.com
2007-01-01,23:44:00,/main/page-3.html,192.168.1.10,jack@doe.com
2007-01-01,23:46:00,/main/page-9.html,192.168.1.10,jack@doe.com
As you can see from above, it took the first and last page of visit from the user. It also automatically inserted the email address into the missing 4th field. Note the bold line where another user "cut" in during the same time with "jane@doe.com". Also, it so happens that "john@doe.com" and "jack@doe.com" are roommates so they have the same IP address, but different email logins. If an IP address has no email address associated with it, then I don't need to see it.

Is this possible to fully automate in a shell script?

Last edited by Micro420; 08-15-2007 at 01:00 AM.
 
Old 08-15-2007, 03:34 AM   #2
ocicat
Member
 
Registered: May 2007
Posts: 207

Rep: Reputation: 46
Quote:
Originally Posted by Micro420 View Post
Is this possible to fully automate in a shell script?
If you can fully enumerate all the edge cases, then yes, you can rearrange the fields to your liking. Although you can probably do everything in sed, it most likely is simpler to do in awk. The following may help provide some insight into what you need to do break down each line into individual fields:

http://www.gnu.org/manual/gawk/html_...ield-Splitting
 
Old 08-15-2007, 10:19 AM   #3
Micro420
Senior Member
 
Registered: Aug 2003
Location: Berkeley, CA
Distribution: Mac OS X Leopard 10.6.2, Windows 2003 Server/Vista/7/XP/2000/NT/98, Ubuntux64, CentOS4.8/5.4
Posts: 2,986

Original Poster
Rep: Reputation: 45
I have actually been using sed and awk, but my understanding of awk is very slim. I can post my shell script later when I am work so you guys can see what I have been able to do and not do.
 
Old 08-16-2007, 11:48 PM   #4
Micro420
Senior Member
 
Registered: Aug 2003
Location: Berkeley, CA
Distribution: Mac OS X Leopard 10.6.2, Windows 2003 Server/Vista/7/XP/2000/NT/98, Ubuntux64, CentOS4.8/5.4
Posts: 2,986

Original Poster
Rep: Reputation: 45
To recap, here's what my CSV files looks like
Code:
2007-01-01,23:30:00,/main/page-1.html,192.168.1.10,,
2007-01-01,23:31:00,/main/page-2.html,192.168.1.10,john@doe.com
2007-01-01,23:32:00,/main/page-3.html,192.168.1.10,
2007-01-01,23:32:10,/main/page-4.html,192.168.1.10,
2007-01-01,23:32:15,/main/page-5.html,192.168.1.10,
2007-01-01,23:35:00,/main/intro.html,192.168.1.25,,
2007-01-01,23:36:00,/main/blah.html,192.168.1.25,,
2007-01-01,23:37:00,/main/blah2.html,192.168.1.25,,
2007-01-01,23:39:00,/main/page-1.html,192.168.1.40,
2007-01-01,23:40:00,/main/page-2.html,192.168.1.40,
2007-01-01,23:41:00,/main/page-3.html,192.168.1.40,,,,blah,blah, jane@doe.com
2007-01-01,23:42:00,/main/blah3.html,192.168.1.25,,
2007-01-01,23:43:00,/main/page-4.html,192.168.1.40,
2007-01-01,23:44:00,/main/page-3.html,192.168.1.10,,,,,jack@doe.com
2007-01-01,23:45:00,/main/page-5.html,192.168.1.10,
2007-01-01,23:46:00,/main/page-9.html,192.168.1.10,,jack@doe.com
Okay, so here is what I have done so far.

My approach:
step 1) extract the IP address of the CSV file which has an email address AND "page-" in the URL since those are the two main things I am looking for
step 2) run this iplist against the CSV to further filter the list and somehow stick the email address at the end
step 3) somehow take the first and last line per IP address and VOILA - all done.

Code:
#!/bin/sh
# Step 1 & 2
# /home/user/logs
for i in *.csv
do
awk '{FS=","} $3 ~ "/page-[0-9]*" {print $0}' $i | awk '{FS=","} "/@/" && $4 ~ "[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*" {print $4}' | uniq -d >> iplist.txt
done

#step 3 - I don't know how to put this all together
#run the IP list against the CSV files
for k in *.csv
cat $k | grep -f iplist.txt | \

#search for "@" in each field per IP address so I know an email exists and store it in variable "email"
email = `gawk '
BEGIN{FS=","; OFS=","};
{ 
  for(i = 5; i <= NF; i++)
  {
    if($i ~ ".*\@.*\.")
    {
      print $1,$2,$3,$4,$i
      break
    }
  }
}'` \ |

cat $email >> finished.csv
done

#somehow magically print only the 1st and last line of IP address.  I know this chops off the ENTIRE list so this goes where? :(
sed -n '1p;$p'

#the end
And this is what it magically looks like in the end.
Code:
2007-01-01,23:30:00,/main/page-1.html,192.168.1.10,john@doe.com
2007-01-01,23:32:15,/main/page-5.html,192.168.1.10,john@doe.com
2007-01-01,23:39:00,/main/page-1.html,192.168.1.40,jane@doe.com
2007-01-01,23:43:00,/main/page-4.html,192.168.1.40,jane@doe.com
2007-01-01,23:44:00,/main/page-3.html,192.168.1.10,jack@doe.com
2007-01-01,23:46:00,/main/page-9.html,192.168.1.10,jack@doe.com

Last edited by Micro420; 08-17-2007 at 12:06 AM.
 
Old 08-17-2007, 10:52 AM   #5
schneidz
Senior Member
 
Registered: May 2005
Location: boston, usa
Distribution: fc-15/ fc-19-live-usb/ aix
Posts: 3,560

Rep: Reputation: 520Reputation: 520Reputation: 520Reputation: 520Reputation: 520Reputation: 520
not sure if this will work universally for your needs:
Code:
awk -F , '{print $1","$2","$3","$NF}' ip.lst | grep @
 
Old 08-21-2007, 12:51 AM   #6
Micro420
Senior Member
 
Registered: Aug 2003
Location: Berkeley, CA
Distribution: Mac OS X Leopard 10.6.2, Windows 2003 Server/Vista/7/XP/2000/NT/98, Ubuntux64, CentOS4.8/5.4
Posts: 2,986

Original Poster
Rep: Reputation: 45
Unfortunately that won't work because the email field isn't always in the last line in the original CSV file.

I somehow need a way to hold the email in a temporary variable/file, and then stick it in the end to the corresponding IP address. Any ideas?

Last edited by Micro420; 08-21-2007 at 01:32 AM.
 
Old 08-21-2007, 09:07 AM   #7
schneidz
Senior Member
 
Registered: May 2005
Location: boston, usa
Distribution: fc-15/ fc-19-live-usb/ aix
Posts: 3,560

Rep: Reputation: 520Reputation: 520Reputation: 520Reputation: 520Reputation: 520Reputation: 520
this is quick-and-dirty but i think it'll work:
Code:
grep @ ip.lst | awk -F , '{print $1","$2","$3","}' > grep-awk.out

i=1
max=`cat ip.lst | wc -l`
# echo i = $i - max = $max
while [ $i -le $max ]
do
sed -n "$i"p ip.lst | tr , "\n" | grep @ >> email.lst
# echo $i - $email
i=`expr $i + 1`
done

paste -d '\0' grep-awk.out email.lst
 
Old 08-22-2007, 12:13 AM   #8
Micro420
Senior Member
 
Registered: Aug 2003
Location: Berkeley, CA
Distribution: Mac OS X Leopard 10.6.2, Windows 2003 Server/Vista/7/XP/2000/NT/98, Ubuntux64, CentOS4.8/5.4
Posts: 2,986

Original Poster
Rep: Reputation: 45
I give up with this project. I posted an ad on Craigslist hoping that someone can finish this project for me since my company will pay for it. I probably spent over 20-30 hours on this and I haven't gotten anywhere! So frustrating.

Not sure if this is okay to ask, but if anyone is interested PM me.
 
Old 08-22-2007, 03:13 AM   #9
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5
Posts: 16,086

Rep: Reputation: 1995Reputation: 1995Reputation: 1995Reputation: 1995Reputation: 1995Reputation: 1995Reputation: 1995Reputation: 1995Reputation: 1995Reputation: 1995Reputation: 1995
So, re-reading your OP, you want to add an email to a line if you've previously seen a line with the same IP and an email.
You also change the associated email for all subsequent lines if a new email appears (see john & jack both using ...10).
If you never see an email for a given IP, you don't want to see that line at all in the output.
IOW, only lines that (after checking) have an email are reported.
Is this correct?
Personally I'd use Perl. Does it have to be shell?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
shell script to read input from csv ip addresses? kr0m3 Programming 3 07-21-2007 08:51 AM
looking for a perl script to convert html table data into a csv file swiftguy121 Linux - Software 2 04-25-2007 07:28 PM
Shell Script For Myspace Log in condonm Programming 1 02-08-2007 06:34 AM
bash script - remove header row from csv file pljvaldez Programming 5 08-30-2006 11:05 AM
Shell script to read from csv file hendemeg Programming 1 05-11-2004 08:23 PM


All times are GMT -5. The time now is 11:05 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration