[SOLVED] Remove the duplicate and count the line!!

grail · 09-19-2011, 08:38 AM

Two options:

1. Use camel case in a character list to get each possible name:

Code:

/^[^ ]+[Aa][Ss][Ii][Aa]/

2. Tell awk to ignore case:

Code:

BEGIN{IGNORECASE=1}/^[^ ]+asia/

I am not sure if I followed all of what you are trying to achieve, but the below might give you some ideas. There are probably more improvements to be had:

Code:

#!/bin/bash

current_date=`date +%d-%m-%Y_%H.%M.%S`
today=`date +%d%m%Y`
yesterday=`date -d 'yesterday' '+%d%m%Y'`
RootPath=/var/domaincount/asia/
MainPath=$RootPath${today}asia
LOG=/var/tmp/log/asia/asiacount$current_date.log

mkdir -p $MainPath
echo Intelliscan Process started for Asia TLD $current_date 

exec 6>&1 >> $LOG

#################################################################################################
## Using Wget Downloading the Zone files it will try only one time
if ! wget --tries=1 --ftp-user=USERNAME --ftp-password=PASSWORD ftp://ftp.anish.com:21/zonefile/anish.zone.gz
then
    echo Download Not Success Domain count Failed With Error
    exit 1
fi
###The downloaded file in Gunzip format from that we need to unzip and start the domain count process####
gunzip asia.zone.gz > $MainPath/$today.asia

###### It will start the Count #####
awk '/^[^ ]+ASIA/ && !_[$1]++{print $1; tot++}END{print "Total",tot,"Domains"}' $MainPath/$today.asia > $RootPath/zonefile/$today.asia
awk '/Total/ {print $2}' $RootPath/zonefile/$today.asia > $RootPath/$today.count

a=$(< $RootPath/$today.count)
b=$(< $RootPath/$yesterday.count)
c=$(awk 'NR==FNR{a[$0];next} $0 in a{tot++}END{print tot}' $RootPath/zonefile/$today.asia $RootPath/zonefile/$yesterday.asia)

echo "$current_date Count For Asia TlD $a"
echo "$current_date Overall Count For Asia TlD $c"
echo "$current_date New Registration Domain Counts $((c - a))"
echo "$current_date Deleted Domain Counts $((c - b))"

exec >&6 6>&-
cat $LOG | mail -s "Asia Tld Count log" 07anis@gmail.com

anishkumarv · 09-19-2011, 02:21 PM

Hi Grail,

Compare to my script..sorry its not a script (just commands) to yours 100% better, Thanks a alot,

anishkumarv · 09-19-2011, 03:51 PM

Hi Grail,

Code:

awk 'BEGIN{IGNORECASE=1}/^[^ ]+asia/ { gsub(/\.$/,"",$1);split($1,a,".")} length(a)==2{b[$1]++;}END{for (x in b)print x}'

using this code i get the main domain alone from a file:

Quote:

0008.ASIA. NS AS2.DNS.ASIA.CN.
0008.ASIA. NS AS2.DNS.ASIA.CN.
ns1.0008.asia NS AS2.DNS.ASIA.CN.
www.0008.asia NS AS2.DNS.ASIA.CN.
anish.asia NS AS2.DNS.ASIA.CN.
ns2.anish.asia NS AS2.DNS.ASIA.CN

like this

Quote:

0008.ASIA
anish.asia

This code working fine some times, but its sometimes skip the entire "www values" is that any way to avoid this If you know kindly post your idea.

anishkumarv · 09-19-2011, 08:55 PM

Hi grail,

Again lot of issues one by one coming in my awk command :-(

0008.ASIA. NS AS2.DNS.ASIA.CN.
0008.ASIA. NS AS2.DNS.ASIA.CN.
ns1.0008.asia. NS AS2.DNS.ASIA.CN.
www.0008.asia. NS AS2.DNS.ASIA.CN.
anish.asia NS AS2.DNS.ASIA.CN.
ns2.anish.asia NS AS2.DNS.ASIA.CN
ANISH.asia. NS AS2.DNS.ASIA.CN.

suppose in a file content like this

using this command,

Code:

awk 'BEGIN{IGNORECASE=1}/^[^ ]+asia/ { gsub(/\.$/,"",$1);split($1,a,".")} length(a)==2{b[$1]++;}END{for (x in b)print x}'

i got output like this

Quote:

0008.ASIA.
anish.asia.
ANISH.asia.

that to duplicate anish.asia., ANISH.asia. both are same domains, i want to display either upper case or lower case.

Note: keep on posting my doubts in this thread in LQ in the sense, please don't think iam not trying like that, iam also working for this issues.

grail · 09-20-2011, 01:00 AM

So in order they were posted:

Quote:

This code working fine some times, but its sometimes skip the entire "www values" is that any way to avoid this If you know kindly post your idea.

You have the following in your code:

Code:

length(a)==2

As www.0008.asia has a length of 3 so it is ignored.

Quote:

that to duplicate anish.asia., ANISH.asia. both are same domains,

The regex is ignoring case but when being stored to be printed later you are not testing to see if they are the same, minus the case.
So you need to add a test for what is being stored in b

PTrenholme · 09-20-2011, 11:09 AM

Quote:

Originally Posted by grail

. . .
The regex is ignoring case but when being stored to be printed later you are not testing to see if they are the same, minus the case.
So you need to add a test for what is being stored in b

Or just use b[toupper($1)]++ or b[tolower($1)]++ to let AWK handle it, picking the case you want displayed.

(As a side comment, you seem to be ignoring the pending change to IP version 6, and assuming that the DNS records are always going to be in version 4 format.)

anishkumarv · 09-22-2011, 10:11 AM

Quote:

Originally Posted by PTrenholme

Or just use b[toupper($1)]++ or b[tolower($1)]++ to let AWK handle it, picking the case you want displayed.

(As a side comment, you seem to be ignoring the pending change to IP version 6, and assuming that the DNS records are always going to be in version 4 format.)

Code:

awk -F'[. ]' 'BEGIN{IGNORECASE=1}$3=="asia" {$1=$2;$2=$3} $2=="asia"&&!_[$1]++{print $1"."$2}END{print "Total",length(_),"Domains"}'

using this i can skip sub domains and all, but still this duplicate

Code:

0008.ASIA.
anish.asia.
ANISH.asia.

this kind of output only still i am getting.. totally sticking with this issue how to avoid this kind of duplicate..

crts · 09-22-2011, 11:07 AM

Hi,

not sure if this might help to count the domains. I used this something similar to eliminate duplicates in a textfile. With some slight modifications it does count the unique domains:

Code:

awk 'BEGIN {IGNORECASE=1;count=0}
{
       $1=gensub("([0-9A-Za-z]+\\.)*([0-9A-Za-z]+\\.[0-9A-Za-z]+)\\.*$", "\\2","1", $1)
       for (i=0;i<count;i++) {
                if (store[i] == $1 || /^[[:blank:]]*$/) {
                        next
                }
        }
        store[count++]=$1
} END {
        for (k=0;k<count;k++) {
                print store[k]
        }
	print "Total: " count
}' "$1"

Tested with:

Code:

0008.ASIA. NS AS2.DNS.ASIA.CN.
0008.ASIA. NS AS2.DNS.ASIA.CN.
ns1.0008.asia. NS AS2.DNS.ASIA.CN.
www.0008.asia. NS AS2.DNS.ASIA.CN.
anish.asia NS AS2.DNS.ASIA.CN.
ns2.anish.asia NS AS2.DNS.ASIA.CN
ANISH.asia. NS AS2.DNS.ASIA.CN.
SWEATY.COM. NS NS1.PARKED.COM.
SWEATY.COM. NS NS2.PARKED.COM.
SWEATYANDREADY.COM. NS NS63.ANISH.COM.
SWEATYANDREADY.COM. NS NS64.ANISH.COM.
SWEATYBANDS.COM. NS NS03.ANISH.COM.
SWEATYBANDS.COM. NS NS04.ANISH.COM.
SWEATYBETTY.COM. NS NS67.ANISH.COM.
SWEATYBETTY.COM. NS NS68.ANISH.COM.
SWEATYDANCER.COM. NS NS13.ANISH.COM.
SWEATYDANCER.COM. NS NS14.ANISH.COM.

It requires GNU awk.

PTrenholme · 09-22-2011, 01:05 PM

You quoted my post, above, but I don't think you tried it. Using the test data crts posted, here's what I get using the tolower function in the code you posted:

Code:

$ cat <<EOF >sample.txt
> 0008.ASIA. NS AS2.DNS.ASIA.CN.
> 0008.ASIA. NS AS2.DNS.ASIA.CN.
> ns1.0008.asia. NS AS2.DNS.ASIA.CN.
> www.0008.asia. NS AS2.DNS.ASIA.CN.
> anish.asia NS AS2.DNS.ASIA.CN.
> ns2.anish.asia NS AS2.DNS.ASIA.CN
> ANISH.asia. NS AS2.DNS.ASIA.CN.
> SWEATY.COM. NS NS1.PARKED.COM.
> SWEATY.COM. NS NS2.PARKED.COM.
> SWEATYANDREADY.COM. NS NS63.ANISH.COM.
> SWEATYANDREADY.COM. NS NS64.ANISH.COM.
> SWEATYBANDS.COM. NS NS03.ANISH.COM.
> SWEATYBANDS.COM. NS NS04.ANISH.COM.
> SWEATYBETTY.COM. NS NS67.ANISH.COM.
> SWEATYBETTY.COM. NS NS68.ANISH.COM.
> SWEATYDANCER.COM. NS NS13.ANISH.COM.
> SWEATYDANCER.COM. NS NS14.ANISH.COM.
> EOF
$ awk 'BEGIN{IGNORECASE=1}/^[^ ]+asia/ { gsub(/\.$/,"",$1);split($1,a,".")} length(a)==2{b[tolower($1)]++;}END{for (x in b)print x}' sample.txt 
sweatybands.com.
sweatyandready.com.
anish.asia
sweatydancer.com.
sweaty.com.
sweatybetty.com.
0008.asia

anishkumarv · 09-22-2011, 01:52 PM

hmmm.... i sticky with this nearly 2 weeks but you are telling iam not tried..

if you read my thread fully means u will find my exact file format, again i posting for you.

Code:

;start: 1315288329
;File created: 2011-09-06 05:52:09 IST
;Export host: 199.115.158.5
;Record count: 2330419
;Created by ANISH

$ORIGIN asia.
@ IN SOA A.COM.ANISH.INFO. NOC.ANISH.INFO. (
                                    2008334441 ; serial
                                    10800 ; refresh
                                    3600 ; retry
                                    2592000 ; expire
                                    86400 ; minimum
                                    )
$TTL 86400

0008.ASIA. NS AS2.DNS.ASIA.CN.
0008.ASIA. NS AS2.DNS.ASIA.CN.
ns1.0008.asia. NS AS2.DNS.ASIA.CN.
www.0008.asia. NS AS2.DNS.ASIA.CN.
anish.asia. NS AS2.DNS.ASIA.CN.
ns2.anish.asia NS AS2.DNS.ASIA.CN
ANISH.ASIA. NS AS2.DNS.ASIA.CN.

;End of file: 1315288329

Code:

awk -F'[. ]' 'BEGIN{IGNORECASE=1}$3=="asia" {$1=$2;$2=$3} $2=="asia"&&!_[$1]++{print $1"."$2}END{print "Total",length(_),"Domains"}' filename

using this code i get the output like this but

Code:

$ORIGIN.asia
0008.ASIA
anish.asia
ANISH.ASIA
Total 4 Domains

i want like this

Quote:

Quote:
0008.ASIA
anish.asia
Total 2 Domains

either this or

Quote:

0008.ASIA
ANISH.ASIA
Total 2 Domains

PTrenholme · 09-22-2011, 06:19 PM

I think you may have misunderstood my comment. I didn't mean to suggest that you weren't trying to solve your problem. I just meant that I saw no evidence that you'd tried to use the tolower or toupper function on your array indies.

In any case, where you have _[$1]++ you need _[tolower($1)]++ because _["ASIA"] and _["asia"] refer to different array elements.

So:

Code:

$ awk -F'[. ]' 'BEGIN{IGNORECASE=1}$3=="asia" {$1=$2;$2=$3} $2=="asia"&&$1&&!_[tolower($1)]++{print $1"."$2}END{print "Total",length(_),"Domains"}' sample2.txt
0008.ASIA
anish.asia
Total 2 Domains

<edit>
Note the test for a non-null $1 in the condition of the print stanza. ($2=="asia"&&$1&&!_[tolower($1)]++)
I had to add that to eliminate a .asia line.

(This is what I was trying to write when he cat intervened. See the next post . . .)
</edit>

PTrenholme · 09-22-2011, 06:37 PM

Error post. (My cat decided to play with my keyboard.)

grail · 09-22-2011, 08:21 PM

They are clever like that (mine has locked me out of my computer before

)

@OP - PT has you on the right path. Without setting the index of your array to one case or the other you will always
receive all variations. Where you may have been getting confused is the IGNORECASE option I gave you. This affecting
the testing you do in your regex but does not affect the items you assign to the array.

anishkumarv · 09-24-2011, 12:32 PM

Hi all,

Thanks a lot to me to solve this..from this thread i learned lot in AWK.

Mainly thanks to GRAIL -dude thanks for your patience to tolerate my bad english.

finally using this command i got the output what i expected

Code:

awk '(i=match($1,/[^.]+\.[Ii][Nn][Ff][Oo]/))&&(d=tolower(substr($1,i,RLENGTH)))&&!a[d]++{print d;tot++}END{print "Total",tot,"Domains"}'

Thanks alot guys :-)

grail · 09-24-2011, 12:41 PM

Well glad you got a working solution, although of course it does not work with any of the data you have provided us but I assume the real data to be very different.