Count domains with AWK
Hi all,
I hope I could get some help with AWK. What I need to achieve is counting all domains in an e-mail log file. The values are in field 7 ($7), and I need to count each domain separately. This is what I have so far: awk '$7=/@./ {som = som+1}; END {print som}' seip1_1_.log But this results in counting 'all' the domains which appear behind the @. I want this result: @domain.com 22 @otherdomain.net 12 @somethingelse.org 5 @other.biz 3 Where the numbers are the amount of results of that domain. The AWK needs to be a oneliner. I really hope someone can help me with this! Thanks in advance!! Sincerely yours, S1GNZ |
Hi,
Is this what you are looking for? awk '{ countArray[$7]++ } END { for (j in countArray) print j,countArray[j] }' infile Hope this helps. |
Quote:
Now I get a nice list of all the e-mail addresses and how many e-mails that has been send from these addresses. But I need only the domains, now that's just a small thing to change in the code but I don't really know how to do that. What I got is this: awk 'BEGIN {$8~/^@./} {countArray[$8]++ } END { for (j in countArray) print j,countArray[j] }' seip1_1_.log Because if you change countArray[$8]++ to countArray[$8~/^@./] it counts all the files with the specified filter. And if you change the way it prints, does it still counts correctly, or do they add up correctly? Thanks, and thanks in advance again :). Sincerely yours, S1GNZ p.s, it might be handy to post a part of the outcome <KAAG@Myanmar.com> 2 <HOOFDDORP@Canada.com> 2 <WIERINGERWAARD@Canada.com> 2 <ZAANDAM@Filipijnen.com> 2 <RIJNSATERWOUDE@Ascension.com> 3 <WARMOND@Colombia.com> 2 <GROET@Mali.com> 2 <WIERINGERWERF@Canada.com> 2 <OPPERDOES@Ivoorkust.com> 2 <HOOGKARSPEL@Canada.com> 2 <ZUIDOOSTBEEMSTER@Canada.com> 2 <OVERVEEN@Ghana.com> 2 <EGMOND-BINNEN@Rhodesie.com> 2 <BERGEN_NH@Mali.com> 1 <ZUID-SCHARWOUDE@Canada.com> 2 <MARKENBINNEN@Canada.com> 2 <STARNMEER@Canada.com> 2 <VOLENDAM@Malawi.com> 264 remote.LUCHTHAVEN_SCHIPHOL@Zwitserland.com 24 <UITDAM@Cuba.com> 34 <OTERLEEK@Canada.com> 2 <ZWAAGDIJK@Rhodesie.com> 2 <AVENHORN@Canada.com> 2 <KOUDEKERK_AAN_DEN_RIJN@Canada.com> 2 <HOOGMADE@Canada.com> 2 <ZWAANSHOEK@Canada.com> 2 <DEN_HOORN_TEXEL@Canada.com> 2 <HOOGWOUD@Canada.com> 2 <OUDE_WETERING@Rhodesie.com> 2 <WATERINGEN@Kroatie.com> 2 <ZUIDSCHERMER@Canada.com> 2 <RIJNSBURG@Frankrijk.com> 5 <SCHAGERBRUG@Albanie.com> 4 <DIRKSHORN@Canada.com> 2 <RIJNSBURG@Slovenie.com> 1 <HEILOO@Mali.com> 2 <ZUIDERMEER@Canada.com> 2 <OUDE_NIEDORP@Monaco.com> 16 <SCHARWOUDE@Zambia.com> 4 remote.LEIMUIDERBRUG@Vaticaanstad.com 2 <SANTPOORT_ZUID@Australisch_Nieuwguinea.com> 2 <UITGEEST@Canada.com> 2 <OUDENDIJK_NH@Canada.com> 2 remote.SPIJKERBOOR_NH@Zwitserland.com 4 <AALSMEERDERBRUG@Marokko.com> 4 remote.VALKENBURG_ZH@Zwitserland.com 1 <KATWIJK_ZH@Jemen.com> 2 <HEEMSTEDE@Canada.com> 2 <HEEMSKERK@Zweden.com> 2 <DEN_HELDER@Canada.com> 2 <T_VELD@Oostenrijk.com> 1 <WATERGANG@Togo.com> 8 remote.MONNICKENDAM@Zwitserland.com 132 <OUDESCHILD@Canada.com> 2 <DEN_OEVER@Canada.com> 2 <SCHARDAM@Canada.com> 2 <RIJPWETERING@Rusland.com> 10 <DEN_BURG@Canada.com> 2 <OUDESLUIS@Canada.com> 2 <SCHAGEN@Canada.com> 2 <LUTJEWINKEL@Canada.com> 2 <ZWAAG@Mali.com> 1 <ILPENDAM@Oostenrijk.com> 1 remote.ZUIDERWOUDE@Zwitserland.com 2 <HILVERSUM@Iran.com> 2 remote.MUIDERBERG@Zwitserland.com 12 remote.ZWANENBURG@Zwitserland.com 17 <LUTJEBROEK@Canada.com> 2 <KATWOUDE@Cuba.com> 4 remote.KORTENHOEF@Zwitserland.com 1 |
Actually it would be more helpful to post some of the input.
However, I did also notice that we have now changed from $7 to $8?? |
Hi again,
Without the appropriate information this will go nowhere. Please post a few relevant lines of the input file and the desired output for those posted lines. |
This is a part of the file as an example
Code:
d k 1004083501.83190500 1004083501.156831500 1004083501.323597500 2950 <AMSTERDAM@Canada.com> local.AMSTERDAM_ZUIDOOST@Frankrijk.com 9238 81 But the outcome I would like to have is: edit, I now know the outcome I need to have Domain Received Send Canada.com 50 0 Malawi.com 61 0 Volendam.com 0 32 etc.. Thanks once again :redface: |
Hi,
You changed the output again. I'm not sure why you need/want 2 numbers behind the domain (Canada.com 50 0 vs Canada.com 50). I'm going to assume that you want to see the name of the domain and the amount (ie: Canada.com 50). awk '/^[a-z]/ { gsub(/.*@/,"",$8) ; gsub(/>/,"",$8) ; countArray[$8]++ } END { for (j in countArray) print j,countArray[j] }' infile /^[a-z]/ => only lines that start with a-z gsub(/.*@/,"",$8) => strip all up to and including the @ from $8 gsub(/>/,"",$8) => strip the > (if present) from $8 countArray[$8]++ => increase counter for specific array END { for (j in countArray) print j,countArray[j] } => print what is found. Sample run: Code:
$ cat infile |
You can place your gsubs together as well:
Code:
awk '!/^$/{gsub(/.*@|>/,"",$8);_[$8]++}END{for (i in _)print i, _[i]}' in.txt |
Quote:
Changing /^[a-z]/ to !/^$/ is also an improvement. |
Ok I made a typo :).
This is for school and the outcome needs to be: Domain Received Send Canada.com 50 0 Malawi.com 61 0 Volendam.com 0 32 And in the file field $7 is the sender and field $8 is the receiver, and basically I need to achieve the above. Can you also use split instead of gsub? Because we need to use arrays and splits if I'm correct. But this already helps us a bunch! Thanks, Sincerely yours, S1GNZ p.s maybe this works without a oneliner because we can also use a script. |
As this is homework I think we have provided you with all the tools you will need.
Should be easy enough to get your field 7 details if you look at the code. btw. we are already using arrays. If you require the use of split then look at the details in your text on that and you should be sweet. |
late reply on this post
Hello
Getting the domains from a mail server loggings : ( For the input files see one of the above mails with "cat infile" Maybe I got the columns the other way around but it seems to work. Maybe this will help somebody in future : ( and maybe somebody has some comments about this solution ) #!/usr/bin/awk -f ###################################################### #Domein Ontvangen ($8=to) Verzonden ($7=from) #Canada.com 50 #Malawi.com 61 0 #Volendam.com 0 32 ######################################################## BEGIN{ printf ("%30.30s %-10.10s %-10.10s\n", "Domain" , "Ontvangen" , "Verzonden" ) } $1 == "d" && split ( $8, teller, "@" ) && split ( teller[2], result, "." ) { ++Domains_to[result[1]] } $1 == "d" && split ( $7, teller, "@" ) && split ( teller[2], result, "." ) { ++Domains_from[result[1]] } END{ # which domains are in Domains_to and Domains_from for ( domain in Domains_to ) { all_domains[domain] = 1 } for (domain in Domains_from ) { all_domains[domain] = 1 } for ( domain in all_domains) { printf ( "%30.30s" , domain ".com" ) if ( domain in Domains_to ) printf ("%10.10s", Domains_to[domain] ) else printf ("%10.10s" , "0") if ( domain in Domains_from ) printf ("%10.10s\n", Domains_from[domain] ) else printf ("%10.10s\n" , "0") } } Output : # ./filter2_4.awk seip1_1.log Domain Ontvangen Verzonden Angola.com 0 2 Hongarije.com 3 0 Burma.com 0 6 Irak.com 0 2 Taiwan.com 0 10 Ghana.com 0 4 Iran.com 0 2 Tunesie.com 12 16 Nieuwzeeland.com 39 0 Ethiopie.com 0 47 Zwitserland.com 418 0 Mauritius.com 2 0 Zuidvietnam.com 0 2 Zaire.com 0 2 Kameroen.com 0 2 Togo.com 0 8 Albanie.com 0 8 Vaticaanstad.com 2 2 Liberia.com 0 8 Laos.com 0 2 Chili.com 43 0 Ivoorkust.com 0 2 Cyprus.com 0 2 Kashmir.com 0 4 Saudi-Arabie.com 0 2 Singapore.com 0 2 Paraguay.com 0 4 Kroatie.com 0 2 Denemarken.com 0 2 Mali.com 0 28 Rusland.com 0 10 Slovenie.com 0 1 Suriname.com 7 12 Zuidafrika.com 0 96 Rhodesie.com 0 48 Brunei.com 0 2 Belgie.com 55 0 Jemen.com 0 4 Botswana.com 24 0 Noordjemen.com 1 0 Australisch_Nieuwguinea.com 0 2 Zambia.com 0 4 Canada.com 0 227 Colombia.com 0 8 Armenie.com 0 2 Ascension.com 0 4 Marokko.com 0 13 Frankrijk.com 357 12 Myanmar.com 0 2 Monaco.com 0 36 Cuba.com 0 38 Malawi.com 0 265 Filipijnen.com 0 4 Zweden.com 0 2 |
All times are GMT -5. The time now is 02:47 AM. |