Filtering a CSV file from web log with shell script?

Micro420 · 08-15-2007, 12:55 AM

I have been working on this project on and off for work for a few months and it's driving me nuts!

I basically have a semi-consistent CSV file from a Microsoft IIS web log. Here is a small snippet of what the log file looks like:

Code:

2007-01-01,23:30:00,/main/page-1.html,192.168.1.10,,
2007-01-01,23:31:00,/main/page-2.html,192.168.1.10,john@doe.com
2007-01-01,23:32:00,/main/page-3.html,192.168.1.10,
2007-01-01,23:32:10,/main/page-4.html,192.168.1.10,
2007-01-01,23:32:15,/main/page-5.html,192.168.1.10,
2007-01-01,23:35:00,/main/intro.html,192.168.1.25,,
2007-01-01,23:36:00,/main/blah.html,192.168.1.25,,
2007-01-01,23:37:00,/main/blah2.html,192.168.1.25,,
2007-01-01,23:39:00,/main/page-1.html,192.168.1.40,
2007-01-01,23:40:00,/main/page-2.html,192.168.1.40,
2007-01-01,23:41:00,/main/page-3.html,192.168.1.40,,,,blah,blah, jane@doe.com
2007-01-01,23:42:00,/main/blah3.html,192.168.1.25,,
2007-01-01,23:43:00,/main/page-4.html,192.168.1.40,
2007-01-01,23:44:00,/main/page-3.html,192.168.1.10,,,,,jack@doe.com
2007-01-01,23:45:00,/main/page-5.html,192.168.1.10,
2007-01-01,23:46:00,/main/page-9.html,192.168.1.10,,jack@doe.com

What I would like to see is this:

Code:

2007-01-01,23:30:00,/main/page-1.html,192.168.1.10,john@doe.com
2007-01-01,23:32:15,/main/page-5.html,192.168.1.10,john@doe.com
2007-01-01,23:39:00,/main/page-1.html,192.168.1.40,jane@doe.com
2007-01-01,23:43:00,/main/page-4.html,192.168.1.40,jane@doe.com
2007-01-01,23:44:00,/main/page-3.html,192.168.1.10,jack@doe.com
2007-01-01,23:46:00,/main/page-9.html,192.168.1.10,jack@doe.com

As you can see from above, it took the first and last page of visit from the user. It also automatically inserted the email address into the missing 4th field. Note the bold line where another user "cut" in during the same time with "jane@doe.com". Also, it so happens that "john@doe.com" and "jack@doe.com" are roommates so they have the same IP address, but different email logins. If an IP address has no email address associated with it, then I don't need to see it.

Is this possible to fully automate in a shell script?

ocicat · 08-15-2007, 03:34 AM

Quote:

Originally Posted by Micro420

Is this possible to fully automate in a shell script?

If you can fully enumerate all the edge cases, then yes, you can rearrange the fields to your liking. Although you can probably do everything in sed, it most likely is simpler to do in awk. The following may help provide some insight into what you need to do break down each line into individual fields:

http://www.gnu.org/manual/gawk/html_...ield-Splitting

Micro420 · 08-15-2007, 10:19 AM

I have actually been using sed and awk, but my understanding of awk is very slim. I can post my shell script later when I am work so you guys can see what I have been able to do and not do.

Micro420 · 08-16-2007, 11:48 PM

To recap, here's what my CSV files looks like

Code:

2007-01-01,23:30:00,/main/page-1.html,192.168.1.10,,
2007-01-01,23:31:00,/main/page-2.html,192.168.1.10,john@doe.com
2007-01-01,23:32:00,/main/page-3.html,192.168.1.10,
2007-01-01,23:32:10,/main/page-4.html,192.168.1.10,
2007-01-01,23:32:15,/main/page-5.html,192.168.1.10,
2007-01-01,23:35:00,/main/intro.html,192.168.1.25,,
2007-01-01,23:36:00,/main/blah.html,192.168.1.25,,
2007-01-01,23:37:00,/main/blah2.html,192.168.1.25,,
2007-01-01,23:39:00,/main/page-1.html,192.168.1.40,
2007-01-01,23:40:00,/main/page-2.html,192.168.1.40,
2007-01-01,23:41:00,/main/page-3.html,192.168.1.40,,,,blah,blah, jane@doe.com
2007-01-01,23:42:00,/main/blah3.html,192.168.1.25,,
2007-01-01,23:43:00,/main/page-4.html,192.168.1.40,
2007-01-01,23:44:00,/main/page-3.html,192.168.1.10,,,,,jack@doe.com
2007-01-01,23:45:00,/main/page-5.html,192.168.1.10,
2007-01-01,23:46:00,/main/page-9.html,192.168.1.10,,jack@doe.com

Okay, so here is what I have done so far.

My approach:
step 1) extract the IP address of the CSV file which has an email address AND "page-" in the URL since those are the two main things I am looking for
step 2) run this iplist against the CSV to further filter the list and somehow stick the email address at the end
step 3) somehow take the first and last line per IP address and VOILA - all done.

Code:

#!/bin/sh
# Step 1 & 2
# /home/user/logs
for i in *.csv
do
awk '{FS=","} $3 ~ "/page-[0-9]*" {print $0}' $i | awk '{FS=","} "/@/" && $4 ~ "[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*" {print $4}' | uniq -d >> iplist.txt
done

#step 3 - I don't know how to put this all together
#run the IP list against the CSV files
for k in *.csv
cat $k | grep -f iplist.txt | \

#search for "@" in each field per IP address so I know an email exists and store it in variable "email"
email = `gawk '
BEGIN{FS=","; OFS=","};
{ 
  for(i = 5; i <= NF; i++)
  {
    if($i ~ ".*\@.*\.")
    {
      print $1,$2,$3,$4,$i
      break
    }
  }
}'` \ |

cat $email >> finished.csv
done

#somehow magically print only the 1st and last line of IP address.  I know this chops off the ENTIRE list so this goes where? :(
sed -n '1p;$p'

#the end

And this is what it magically looks like in the end.

Code:

2007-01-01,23:30:00,/main/page-1.html,192.168.1.10,john@doe.com
2007-01-01,23:32:15,/main/page-5.html,192.168.1.10,john@doe.com
2007-01-01,23:39:00,/main/page-1.html,192.168.1.40,jane@doe.com
2007-01-01,23:43:00,/main/page-4.html,192.168.1.40,jane@doe.com
2007-01-01,23:44:00,/main/page-3.html,192.168.1.10,jack@doe.com
2007-01-01,23:46:00,/main/page-9.html,192.168.1.10,jack@doe.com

schneidz · 08-17-2007, 10:52 AM

not sure if this will work universally for your needs:

Code:

awk -F , '{print $1","$2","$3","$NF}' ip.lst | grep @

Micro420 · 08-21-2007, 12:51 AM

Unfortunately that won't work because the email field isn't always in the last line in the original CSV file.

I somehow need a way to hold the email in a temporary variable/file, and then stick it in the end to the corresponding IP address. Any ideas?

schneidz · 08-21-2007, 09:07 AM

this is quick-and-dirty but i think it'll work:

Code:

grep @ ip.lst | awk -F , '{print $1","$2","$3","}' > grep-awk.out

i=1
max=`cat ip.lst | wc -l`
# echo i = $i - max = $max
while [ $i -le $max ]
do
sed -n "$i"p ip.lst | tr , "\n" | grep @ >> email.lst
# echo $i - $email
i=`expr $i + 1`
done

paste -d '\0' grep-awk.out email.lst

Micro420 · 08-22-2007, 12:13 AM

I give up with this project.

I posted an ad on Craigslist hoping that someone can finish this project for me since my company will pay for it. I probably spent over 20-30 hours on this and I haven't gotten anywhere! So frustrating.

Not sure if this is okay to ask, but if anyone is interested PM me.

chrism01 · 08-22-2007, 03:13 AM

So, re-reading your OP, you want to add an email to a line if you've previously seen a line with the same IP and an email.
You also change the associated email for all subsequent lines if a new email appears (see john & jack both using ...10).
If you never see an email for a given IP, you don't want to see that line at all in the output.
IOW, only lines that (after checking) have an email are reported.
Is this correct?
Personally I'd use Perl. Does it have to be shell?