Filtering a CSV file from web log with shell script?
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
As you can see from above, it took the first and last page of visit from the user. It also automatically inserted the email address into the missing 4th field. Note the bold line where another user "cut" in during the same time with "jane@doe.com". Also, it so happens that "john@doe.com" and "jack@doe.com" are roommates so they have the same IP address, but different email logins. If an IP address has no email address associated with it, then I don't need to see it.
Is this possible to fully automate in a shell script?
Is this possible to fully automate in a shell script?
If you can fully enumerate all the edge cases, then yes, you can rearrange the fields to your liking. Although you can probably do everything in sed, it most likely is simpler to do in awk. The following may help provide some insight into what you need to do break down each line into individual fields:
Distribution: Mac OS X Leopard 10.6.2, Windows 2003 Server/Vista/7/XP/2000/NT/98, Ubuntux64, CentOS4.8/5.4
Posts: 2,986
Original Poster
Rep:
I have actually been using sed and awk, but my understanding of awk is very slim. I can post my shell script later when I am work so you guys can see what I have been able to do and not do.
My approach:
step 1) extract the IP address of the CSV file which has an email address AND "page-" in the URL since those are the two main things I am looking for
step 2) run this iplist against the CSV to further filter the list and somehow stick the email address at the end
step 3) somehow take the first and last line per IP address and VOILA - all done.
Code:
#!/bin/sh
# Step 1 & 2
# /home/user/logs
for i in *.csv
do
awk '{FS=","} $3 ~ "/page-[0-9]*" {print $0}' $i | awk '{FS=","} "/@/" && $4 ~ "[0-9]*\.[0-9]*\.[0-9]*\.[0-9]*" {print $4}' | uniq -d >> iplist.txt
done
#step 3 - I don't know how to put this all together
#run the IP list against the CSV files
for k in *.csv
cat $k | grep -f iplist.txt | \
#search for "@" in each field per IP address so I know an email exists and store it in variable "email"
email = `gawk '
BEGIN{FS=","; OFS=","};
{
for(i = 5; i <= NF; i++)
{
if($i ~ ".*\@.*\.")
{
print $1,$2,$3,$4,$i
break
}
}
}'` \ |
cat $email >> finished.csv
done
#somehow magically print only the 1st and last line of IP address. I know this chops off the ENTIRE list so this goes where? :(
sed -n '1p;$p'
#the end
And this is what it magically looks like in the end.
Distribution: Mac OS X Leopard 10.6.2, Windows 2003 Server/Vista/7/XP/2000/NT/98, Ubuntux64, CentOS4.8/5.4
Posts: 2,986
Original Poster
Rep:
I give up with this project. I posted an ad on Craigslist hoping that someone can finish this project for me since my company will pay for it. I probably spent over 20-30 hours on this and I haven't gotten anywhere! So frustrating.
Not sure if this is okay to ask, but if anyone is interested PM me.
So, re-reading your OP, you want to add an email to a line if you've previously seen a line with the same IP and an email.
You also change the associated email for all subsequent lines if a new email appears (see john & jack both using ...10).
If you never see an email for a given IP, you don't want to see that line at all in the output.
IOW, only lines that (after checking) have an email are reported.
Is this correct?
Personally I'd use Perl. Does it have to be shell?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.