bash script, parsing email addresses

kepler · 01-19-2004, 06:12 AM

Hello All,
Hope someone might be able to point me in the right direction with this problem. I'm trying to generate a report using a bash script that lists out a pile of email addresses and the amount of times they appear in a log file (Spammers) but I only want to search for these email addresses by the 'domain name'.
eg. I have the following email addresses in my log file:

..
cs6710132-199.houston.rrbbww.com
adsl-66-127-81-138.dsl.sntc01.pacabell342.net
earthping.co.uk
..

and I want to only cut out the last two or three parts of the domain name
i.e:

houston.rrbbww.com
sntc01.pacabell342.net
earthping.co.uk

I've tried using the cut command like so:

cut -d. -f2-5 logfile > result

which doesn't really work as it assumes that leaving out the first set of characters before the first period is enough... which it ain't! I need to work from the end and work backwards. Any ideas??

K.

leonscape · 01-19-2004, 06:38 AM

You could try loading the string into the script and use the bash string handling. A regex maybe useful ( or maybe not ).

kepler · 01-19-2004, 06:48 AM

Thanks for the quick reply.

I'm new to all this so can you explain the bash string handler to me? or just a name of the command or something.

The recomp looks a bit complex for me to use... going by it's man page.

The amount of records I'm sorting through is roughy 1.3 million, so I can't hold it all in memory, I must dump everything into files during the whole process.

leonscape · 01-19-2004, 07:30 AM

Okay. First you might want to look at the Advanced Bash-Scripting Guide which has a lot of info about this.

Awk stuff is probably what your looking for.

kepler · 01-20-2004, 10:29 AM

Fair enough.
I read through the awk page and tried using awk -F. '{print $fieldno}' to seperate out the email addresses into different fields. However since the amount of actually 'fields' vary from address to address I'm kinda back to square one. Is there something handy that will allow me to go directly to the last field for each email address and work backwards from there.

jim mcnamara · 01-20-2004, 02:48 PM

Use IFS and read

Code:

#! /bin/sh
email="somebody@some.domain.name.com"
IFS=@
echo $email | read uname dmname
IFS=.
echo $dmname | read var1 var 2 var 3 var4 var5
unset IFS

Anytime there are fewer than 5 "parts" the trailing variables will be null.
dmname has the name of the domain, plus the hostname sometimes.
var1...var5 parse out each component so you can use them.

kepler · 01-26-2004, 06:47 AM

Well thanks for all the advice, I've managed a way to do what I need to do. However the code isn't the best and can crash out under certain circumstances but here it is:

cat emaillisting | awk -F. '{print $(( NF - 1 )) "." $NF}' > domainsfile

(NF = Number of Fields)

will output the last two parts of the domain name, by changing the print section to:

{print $(( NF - 2 )) "." $(( NF - 1 )) "." $NF}

will output the last three parts etc. etc.

Though this can spit out an error if the original email address has no periods, which is the case with a lot of these spammers.

K.