LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 09-06-2011, 01:03 PM   #1
anishkumarv
Member
 
Registered: Feb 2010
Location: chennai - India
Distribution: centos
Posts: 289

Rep: Reputation: 10
Remove the duplicate and count the line!!


Hi all,

Thanks in Advance,

I am writing a script, to remove the duplicates in a file and count the unique
number of domains in a file,

Example:

Code:
0008.COM. NS AS2.DNS.COM.CN.
0008.COM. NS AS3.DNS.COM.CN.
000800.COM. NS AS1.DNS.COM.CN.
000800.COM. NS AS2.DNS.COM.CN.
000858.COM. NS AS1.CNOLNIC.COM.
000858.COM. NS AS2.CNOLNIC.COM.
000861.COM. NS AS1.DYNADOT.COM.
000861.COM. NS AS2.DYNADOT.COM.
000899.COM. NS AS.GIW.COM.SG.
000899.COM. NS AS2.GIW.COM.SG
This is the sample file and i want the output like this,

Code:
0008.COM
000800.COM
000858.COM
000861.COM
000899.COM

Total 5 Domains
i am trying to get this output using AWK, kindly post your ideas about this thread.
 
Old 09-06-2011, 01:19 PM   #2
pwc101
Senior Member
 
Registered: Oct 2005
Location: UK
Distribution: Slackware
Posts: 1,847

Rep: Reputation: 128Reputation: 128
Here's a solution with cut, uniq and wc:
Code:
cut -f1 -d' ' < intputfile.txt | uniq | wc -l
That gets you the unique number of domains. Remove the | wc -l at the end to get the list of unique domains.
 
Old 09-06-2011, 01:55 PM   #3
PTrenholme
Senior Member
 
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,186

Rep: Reputation: 346Reputation: 346Reputation: 346Reputation: 346
Or, to save running it twice, $ cut -f1 -d' ' < inputfile.txt | uniq | tee /dev/stderr | wc -l which yields
Code:
$ cut -f1 -d' ' < test.txt | uniq | tee /dev/stderr | wc -l
0008.COM.
000800.COM.
000858.COM.
000861.COM.
000899.COM.
5
Or, for an AWK solution:
Code:
$ awk -F" " '{++uniq[$1]} END {for (i in uniq) {++n;print i};print "Total: " n}' test.txt
000899.COM.
000861.COM.
000858.COM.
0008.COM.
000800.COM.
Total: 5
<edit>
Oh, to make the AWK code produce the output you specified, this should work:
Code:
$ awk -F". NS" '{++uniq[$1]} END {n=asorti(uniq,ind);for (i=1;i<=n;++i) {print ind[i] "\t(" uniq[ind[i]] " times)"};print "\nTotal: " n}' test.txt
0008.COM        (2 times)
000800.COM      (2 times)
000858.COM      (2 times)
000861.COM      (2 times)
000899.COM      (2 times)

Total: 5
</edit>

Last edited by PTrenholme; 09-06-2011 at 02:14 PM.
 
1 members found this post helpful.
Old 09-06-2011, 03:19 PM   #4
anishkumarv
Member
 
Registered: Feb 2010
Location: chennai - India
Distribution: centos
Posts: 289

Original Poster
Rep: Reputation: 10
Hi all,

Thanks alot..its sounds great :-) it works like charm..but i want to skip few lines in this file from beginning and end.

The actual content is like this.


Code:
;start: 1315288329
;File created: 2011-09-06 05:52:09 IST
;Export host: 199.115.158.5
;Record count: 2330419
;Created by ANISH 

$ORIGIN COM.
@ IN SOA A.COM.ANISH.INFO. NOC.ANISH.INFO. (
                                2008334441 ; serial
                                10800 ; refresh
                                3600 ; retry
                                2592000 ; expire
                                86400 ; minimum
                                )
$TTL 86400
COM. NS A0.COM.ANISH.INFO.
COM. NS A2.COM.ANISH-NST.INFO.
COM. NS B0.COM.ANISH-NST.ORG.
COM. NS B2.COM.ANISH-NST.ORG.
COM. NS C0.COM.ANISH-NST.INFO.
COM. NS D0.COM.ANISH-NST.ORG.
0008.COM. NS AS2.DNS.COM.CN.
0008.COM. NS AS3.DNS.COM.CN.
000800.COM. NS AS1.DNS.COM.CN.
000800.COM. NS AS2.DNS.COM.CN.
000858.COM. NS AS1.CNOLNIC.COM.
000858.COM. NS AS2.CNOLNIC.COM.
000861.COM. NS AS1.DYNADOT.COM.
000861.COM. NS AS2.DYNADOT.COM.
000899.COM. NS AS.GIW.COM.SG.
000899.COM. NS AS2.GIW.COM.SG
;End of file: 1315288329
need to skip first 21 default lines and last 1 line then remove the duplicate domain names and count the domain names.


Code:
sed -i '1,21d' file
Using this code i removed the firs default 21 lines how to remove the last line also at the same time..

Last edited by anishkumarv; 09-06-2011 at 03:59 PM.
 
Old 09-06-2011, 04:12 PM   #5
PTrenholme
Senior Member
 
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,186

Rep: Reputation: 346Reputation: 346Reputation: 346Reputation: 346
Would it suffice to ignore any line that didn't start with a digit? If so, this trivial change does it:
Code:
$ awk -F". NS" '/^[[:digit:]]/{++uniq[$1]} END {n=asorti(uniq,ind);for (i=1;i<=n;++i) {print ind[i] "\t(" uniq[ind[i]] " times)"};print "\nTotal: " n}' test2.txt
0008.COM        (2 times)
000800.COM      (2 times)
000858.COM      (2 times)
000861.COM      (2 times)
000899.COM      (2 times)

Total: 5
Note: "test2.txt" is a copy from your post, above.

Skipping the first 21 lines is easily accomplished by a clause like this (NR<21){next}, but shipping the last record is fairly difficult unless you want to skip any line that starts with a semi-colon. If that's acceptable, this should do it:
Code:
$ awk -F". NS" '(NR<22){next};/^;/{next};{++uniq[$1]} END {n=asorti(uniq,ind);for (i=1;i<=n;++i) {print ind[i] "\t(" uniq[ind[i]] " times)"};print "\nTotal: " n}' test2.txt
0008.COM        (2 times)
000800.COM      (2 times)
000858.COM      (2 times)
000861.COM      (2 times)
000899.COM      (2 times)

Total: 5
 
Old 09-06-2011, 07:02 PM   #6
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,240

Rep: Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324
Another option is to pipe the orig file through the head & tail cmds to strip of hdrs/footers, then apply solns as above.
 
Old 09-06-2011, 08:09 PM   #7
kurumi
Member
 
Registered: Apr 2010
Posts: 228

Rep: Reputation: 45
Code:
$ ruby -ane 'BEGIN{h={};h.default=0};h[$F[0]]+=1 if /NS/&&!/^COM/;END{h.each{|x,y|print "#{x}: #{y} times\n"}} ' file
0008.COM.: 2 times
000800.COM.: 2 times
000858.COM.: 2 times
000861.COM.: 2 times
000899.COM.: 2 times
 
Old 09-07-2011, 12:32 AM   #8
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,244

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
Slightly different awk:
Code:
awk '/^[0-9]/ && !_[$1]++{print $1; tot++}END{print "\nTotal",tot,"Domains"}' file
 
Old 09-07-2011, 02:43 AM   #9
anishkumarv
Member
 
Registered: Feb 2010
Location: chennai - India
Distribution: centos
Posts: 289

Original Poster
Rep: Reputation: 10
Hi all, Thanks for the Reply !!!

Grail Your are just awesome..The command which you given was works great, i can able to understand others command slightly..but yours quite difficult, can you give just a simple explaination about your command, maybe its sounds like spoon feeding..:-) but iam new to awk so.
 
Old 09-07-2011, 02:57 AM   #10
indeliblestamp
Member
 
Registered: Feb 2006
Distribution: Fedora
Posts: 341
Blog Entries: 3

Rep: Reputation: 39
My way:
Code:
thaum ~/code/shell$ awk '{print $1}' scrap.txt | grep ^[0-9].*COM | sort -u
0008.COM.
000800.COM.
000858.COM.
000861.COM.
000899.COM.
i.e. print first field using awk, grep only for those starting with numbers followed by 'COM' and use the sort command to show unique entries.
 
Old 09-07-2011, 03:10 AM   #11
kurumi
Member
 
Registered: Apr 2010
Posts: 228

Rep: Reputation: 45
Quote:
Originally Posted by arungoodboy View Post
My way:
Code:
thaum ~/code/shell$ awk '{print $1}' scrap.txt | grep ^[0-9].*COM | sort -u
0008.COM.
000800.COM.
000858.COM.
000861.COM.
000899.COM.
i.e. print first field using awk, grep only for those starting with numbers followed by 'COM' and use the sort command to show unique entries.
awk and grep can be combined using awk. No point wasting another process.
 
Old 09-07-2011, 03:14 AM   #12
indeliblestamp
Member
 
Registered: Feb 2006
Distribution: Fedora
Posts: 341
Blog Entries: 3

Rep: Reputation: 39
Yeah, but I don't know awk that well For small files like these, I find it more efficient to chain simple commands together.
 
Old 09-07-2011, 04:17 AM   #13
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,244

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
No problem:

/^[0-9]/ - only test lines that start with a digit

&& - boolean and

!_[$1]++ - everything before {} is an expression to test. This one says that whenever $1 is the same that you increment the array called _ by one. For an expression
to be true it must evaluate to not 0 (mostly), so when you hit the first copy of $1 for each value the value is zero so we use the not (!) to negate it and perform
whatever is in the curly braces

{print $1; tot++} - print the first field and increment our counter to say we found one of what we are looking for

END - this is only executed at the end of reading the file

{print "\nTotal",tot,"Domains"} - print the total we found

Let me know if any of that is not clear enough?
 
Old 09-07-2011, 04:27 AM   #14
indeliblestamp
Member
 
Registered: Feb 2006
Distribution: Fedora
Posts: 341
Blog Entries: 3

Rep: Reputation: 39
Nicely explained, thanks
 
Old 09-07-2011, 10:05 AM   #15
anishkumarv
Member
 
Registered: Feb 2010
Location: chennai - India
Distribution: centos
Posts: 289

Original Poster
Rep: Reputation: 10
Quote:
Originally Posted by grail View Post
No problem:

/^[0-9]/ - only test lines that start with a digit

&& - boolean and

!_[$1]++ - everything before {} is an expression to test. This one says that whenever $1 is the same that you increment the array called _ by one. For an expression
to be true it must evaluate to not 0 (mostly), so when you hit the first copy of $1 for each value the value is zero so we use the not (!) to negate it and perform
whatever is in the curly braces

{print $1; tot++} - print the first field and increment our counter to say we found one of what we are looking for

END - this is only executed at the end of reading the file

{print "\nTotal",tot,"Domains"} - print the total we found

Let me know if any of that is not clear enough?

Thanks alot Grail but it works for the domains which starts for only integers right?

Suppose the domains start with alphabets means it will not work right?

Code:
SWEATY.COM. NS NS1.PARKED.COM.
SWEATY.COM. NS NS2.PARKED.COM.
SWEATYANDREADY.COM. NS NS63.ANISH.COM.
SWEATYANDREADY.COM. NS NS64.ANISH.COM.
SWEATYBANDS.COM. NS NS03.ANISH.COM.
SWEATYBANDS.COM. NS NS04.ANISH.COM.
SWEATYBETTY.COM. NS NS67.ANISH.COM.
SWEATYBETTY.COM. NS NS68.ANISH.COM.
SWEATYDANCER.COM. NS NS13.ANISH.COM.
SWEATYDANCER.COM. NS NS14.ANISH.COM.
For this examples your command wont work right?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] How to remove duplicate files from Two Folders? guessity Linux - Newbie 19 09-09-2013 02:52 PM
remove duplicate entries from first column?? kadvar Programming 2 05-12-2010 07:22 PM
Script to count # of chars per line (if line meets certain criteria) and get avg #? kmkocot Linux - Newbie 3 09-13-2009 12:05 PM
Perl question: delete line from text file with duplicate match at beginning of line mrealty Programming 7 04-01-2009 07:46 PM
duplicate the line of a text file to the same line powah Programming 4 01-11-2007 09:27 PM


All times are GMT -5. The time now is 01:48 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration