LinuxQuestions.org
View the Most Wanted LQ Wiki articles.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices

Reply
 
Search this Thread
Old 04-21-2012, 05:01 AM   #1
mannoj87
LQ Newbie
 
Registered: Apr 2012
Posts: 9

Rep: Reputation: Disabled
CUT | SORT | UNIQ -D | Line number of original file?


Hi I'm trying to get this some way.. hope its not challenging for others.. I have just started to learn shell..
Below is my reqmnt..

I'm taking out uniq lines and duplicate lines for the categorized fields in cut, with the number of times duplicates occured group by the categorized fields. But the challenge is I once I sort and find the uniq and duplicates with count(dups) seperated I should take their line number of the first occurence from the original file and tag it any where in below output. Not sure if I have explained what I need rightly.. any thoughts to print the line number whereas ignore-chars in uniq removes the char from printing itself.

cut -d',' -f2,3,6,15 n_renewstatus_2012-04-19.txt |sort | uniq -d -c | head -3

2 GAFD,919702214713,SUCCESS,20120419
2 GAFD,919795928292,SUCCESS,20120419
2 GAJD,919553311089,SUCCESS,20120419

---> I need the line number of the first occurence of duplicate any where in its line. uniq -d doesn't allow to bring into the picture.
 
Old 04-21-2012, 05:11 AM   #2
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,562

Rep: Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939
I would have to say I am not clear on what you want Maybe you could provide sample data to explain each step and the output along the way to the final output?
 
Old 04-21-2012, 05:47 AM   #3
mannoj87
LQ Newbie
 
Registered: Apr 2012
Posts: 9

Original Poster
Rep: Reputation: Disabled
Smile

The given file is in the below format.

GGRPW,33332211,kr,P,SUCCESS,systemrenewal,REN,RAMS,SAA,0080527763,on:X,10.0,N,20120419,migr
GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr
GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr


I need to take out duplicates and count(each duplicates categorized by f1,2,5,14). Then insert into database with the first duplicate occurence record entire fields and tag the count(dups) in another column. For this I need to cut all the 4 mentioned fields and sort and find the dups using uniq -d and for counts I used -c. Now again coming back after all sorting out of dups and it counts I need the output to be in the below form.


3,GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr
Whereas three being the number of repeated dups for f1,2,5,14 and rest of the fields can be from any of the dup rows.

By this way dups should be removed from the original file and show in the above format.
And the remaining in the original file will be uniq ones they go as it is...

----

What I have done is.. let me not confuse.. this needs a different point of view.. and my brain is clinging on my approach.. need a cigar..
Any thots...??
 
Old 04-21-2012, 06:26 AM   #4
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
I don't think the count of duplicates will help you.

consider this example:
Code:
cat -n test | sort -k2 | uniq -f1 -d
     1  a1210
    24  a1213
cat -n test | sort -k2 | uniq -f1 -D
     1  a1210
    23  a1210
    25  a1210
    24  a1213
     4  a1213
Using "cat -n" adds line numbers to the file.
Opps, I just noticed that field 1 is out of order.
Code:
cat -n test | sort -k2 -k1n | uniq -f1 -d
     1  a1210
     4  a1213
cat -n test | sort -k2 -k1n | uniq -f1 -D
     1  a1210
    23  a1210
    25  a1210
     4  a1213
    24  a1213
That's better.

Code:
while read linenum pattern ; do echo -en "$linenum " && grep  $pattern firstdupes ; done <dupes 
23      1       a1210
25      1       a1210
24      4       a1213
I redirected the first list to "firstdupes" and the second list to "alldupes"
Code:
grep -v -f firstdupes alldupes
    23  a1210
    25  a1210
    24  a1213
Let's redirect that to "dupes"

Code:
cat -n test | sort -k2 -k1n | uniq -f1 -d >firstdupes
cat -n test | sort -k2 -k1n | uniq -f1 -D >alldupes
grep -v -f firstdupes alldupes >dupes
Now let's produce a list containing the line numbers for duplicate lines, along with the line number of the first occurrence (on the same line entry)
Code:
while read linenum pattern ; do echo -en "$linenum\t" && grep  $pattern firstdupes | cut -d1-2; done <dupes 
23      1
25      1
24      4
The first field contains the line number of the duplicate line. The second field contains the line number of the original. Now for each line in field 1, tag with value of field 2. You could do this in a for loop or perhaps generate a SED or AWK script from these values.

Last edited by jschiwal; 04-21-2012 at 06:29 AM.
 
Old 04-21-2012, 06:49 AM   #5
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
Sorry, as I was preparing my response, your post above explained that you want to remove the duplicates, not tag them.

If you take the first field from the "dupes", you have the line numbers to remove. You can remove them with a sed or awk script. This will generate the sed script to delete those lines:

Code:
grep -v -f firstdupes alldupes | cut -f1 | sed 's# *\([[:digit:]]*\)#/\1/d#'
/23/d
/25/d
/24/d
Code:
grep -v -f firstdupes alldupes | cut -f1 | sed 's# *\([[:digit:]]*\)#/\1/d#' >removedupes.sed
sed -f removedupes.sed test
Then you could create a separate script where instead of deleting, that line is printed.
Code:
grep -v -f firstdupes alldupes | cut -f1 | sed 's# *\([[:digit:]]*\)#/\1/p#' savedupes.sed
sed -n -f savedupes.sed test
---
If you have a problem explaining the problem, you will have an even harder time finding a solution. Try to define the problem (to yourself) so it is crystal clear. After that, the solution will be easier to find.
---
Your posted sample is small, and doesn't have duplicates. Maybe this one would be better:
Code:
GGRPW,33332211,kr,P,SUCCESS,systemrenewal,REN,RAMS,SAA,0080527763,on:X,10.0,N,20120419,migr
GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr
GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr
GGRPW,33332211,kr,P,SUCCESS,systemrenewal,REN,RAMS,SAA,0080527763,on:X,10.0,N,20120419,migr
GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr
GGRPW,33332211,kr,P,SUCCESS,systemrenewal,REN,RAMS,SAA,0080527763,on:X,10.0,N,20120419,migr
GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr
GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr
GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr
GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr
GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr

Last edited by jschiwal; 04-21-2012 at 07:15 AM.
 
Old 04-21-2012, 08:30 AM   #6
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,562

Rep: Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939
Well ... still appear to have my thick ears on , but I am hoping I am partially getting it.

1. Is it necessary to count any of this? (as mentioned by jschiwal)

2. Is the idea to create a file that has only uniq first entries based on the fields you mentioned?

If the answers are no and yes, how about (untested):
Code:
awk -F, '!(($1 SUBSEP $2 SUBSEP $5 SUBSEP $14) in uniq){uniq[$1,$2,$5,$14];print}' file
You can then redirect and rename the file and so on as you wish. To give you an idea, if we use jschiwal's example the output is:
Code:
GGRPW,33332211,kr,P,SUCCESS,systemrenewal,REN,RAMS,SAA,0080527763,on:X,10.0,N,20120419,migr
GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr
GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr
 
Old 04-21-2012, 10:49 AM   #7
mannoj87
LQ Newbie
 
Registered: Apr 2012
Posts: 9

Original Poster
Rep: Reputation: Disabled
Sorry mates for the confusion.. Let me make this time clearer..

Below is the raw file which is given to me.

I have to cut and see the duplicates only for the fields 1,2,5,14 existing in any other rows.

GLUTW,91672651,tn,P,SUCCESS,systemrenewal,REN,s,ss,7311598203,cX,10.0,N,20120419,migr
GLUTW,91672651,tn,P,SUCCESS,systemrenewal,REN,s,ss,7311598203,cX,10.0,N,20120419,migr
GLUTW,91671651,tn,P,SUCCESS,systemrnewal,REN,s,ss,7311598203,cX,10.0,N,20120419,migr
GLUTW,91671651,tn,P,SUCCESS,systemrnewal,REN,s,ss,7311598203,cX,10.0,N,20120419,migr
GLUTW,91671651,tn,P,SUCCESS,systemrnewal,REN,s,ss,7311598203,cX,10.0,N,20120419,migr
GLUTW,916716511,tn,P,SUCCESS,systemrnewal,REN,s,ss,7311598203,cX,10.0,N,20120419,migr

If yes then take the count of duplicate rows and keep the first occurence of duplicate row(all fields) with the count value added at the end. Eg. should return in the below way for the above raw file.

GLUTW,91672651,tn,P,SUCCESS,systemrenewal,REN,s,ss,7311598203,cX,10.0,N,20120419,migr (2)
GLUTW,91671651,tn,P,SUCCESS,systemrnewal,REN,s,ss,7311598203,cX,10.0,N,20120419,migr (3)


And put the uniq ones in a different file

GLUTW,916716511,tn,P,SUCCESS,systemrnewal,REN,s,ss,7311598203,cX,10.0,N,20120419,migr (1)

Thanks for your replies.. !!
 
Old 04-21-2012, 11:19 AM   #8
pan64
Senior Member
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 4,928

Rep: Reputation: 1305Reputation: 1305Reputation: 1305Reputation: 1305Reputation: 1305Reputation: 1305Reputation: 1305Reputation: 1305Reputation: 1305Reputation: 1305
I would rather solve it in perl or awk. You can read the lines, parse and sort them, whatever you want. A comment: we do not need to sort lines (in perl) to find duplicates and count occurences. We just need to read lines and use a counter. Finally you can select lines by counter or sort by any key. Also I do not know if first occurrence is important, but it is also an additional variable, so one additional line in the script (storing the line numbers is also similar)

Last edited by pan64; 04-21-2012 at 11:20 AM. Reason: extended
 
Old 04-21-2012, 12:42 PM   #9
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,562

Rep: Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939
So then something like:
Code:
awk -F, '!(($1 SUBSEP $2 SUBSEP $5 SUBSEP $14) in uniq){uniq[$1,$2,$5,$14]=$0}{count[$1,$2,$5,$14]++}END{for(i in count){if(count[i] > 1)file="dupes";else file="uniq";print uniq[i],"("count[i]")" > file}}' orig_file
This will create 2 files, dupes and uniq
 
1 members found this post helpful.
Old 04-22-2012, 01:53 AM   #10
mannoj87
LQ Newbie
 
Registered: Apr 2012
Posts: 9

Original Poster
Rep: Reputation: Disabled
Genius.. !! Such a powerful syntax reduced time totally to nanosecs...

OUTPUT :

Instead of () I added , and found the ratio differences in the below manner. And IT EXACTLY MATCHED with the total number of ROWS in the original file.

awk -F, '!(($1 SUBSEP $2 SUBSEP $5 SUBSEP $14) in uniq){uniq[$1,$2,$5,$14]=$0}{count[$1,$2,$5,$14]++}END{for(i in count){if(count[i] > 1)file="dupes";else file="uniq";print uniq[i],","count[i] > file}}' renewstatus_2012-04-19.txt


sym@localhost:~$ cut -f16 -d',' uniq | sort | uniq -d -c
124275 1 -----> SUM OF UNIQ ( 1 )ENTRIES

sym@localhost:~$ cut -f16 -d',' dupes | sort | uniq -d -c
3860 2
850 3
71 4
7 5
3 6
sym@localhost:~$ cut -f16 -d',' dupes | sort | uniq -u -c
1 7
------------------
10614 ------> SUM OF DUPLICATE ENTRIES MULTIPLIED WITH ITS COUNTS


sym@localhost:~$ wc -l renewstatus_2012-04-19.txt
134889 renewstatus_2012-04-19.txt ---> TOTAL LINE COUNTS OF THE ORIGINAL FILE, MATCHED EXACTLY WITH (124275+10614) = 134889

Brilliant play with AWKS... Thanks a lot Grail !!! You were a delight.. I will practice on awks more ..!!
 
Old 04-22-2012, 03:12 AM   #11
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,562

Rep: Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939Reputation: 1939
Glad it helped ... here is a reference for you to learn some more - http://www.gnu.org/software/gawk/man...ode/index.html
 
Old 04-22-2012, 04:10 AM   #12
mannoj87
LQ Newbie
 
Registered: Apr 2012
Posts: 9

Original Poster
Rep: Reputation: Disabled
nice thanks for the link too..
 
Old 04-22-2012, 07:55 AM   #13
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
Nice one Grail.
 
Old 04-22-2012, 08:54 AM   #14
allend
Senior Member
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 3,464

Rep: Reputation: 852Reputation: 852Reputation: 852Reputation: 852Reputation: 852Reputation: 852Reputation: 852
+1 to that. It is beautiful to watch an awk expert at work. :-)
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] using sort and uniq in bash bibiki Linux - Newbie 2 02-19-2011 10:12 AM
history |tr '\011' ' ' |tr -s " "| cut -d' ' -f3 |sort |uniq -c |sort -nbr |head -n10 alan_ri General 12 12-04-2010 09:01 PM
How to sort by line size (number of characters in a line) fast_rizwaan Linux - General 8 01-08-2010 05:53 PM
Use uniq on first part of file but print whole line. snowman81 Programming 4 10-03-2009 06:22 AM
sort & uniq tostay2003 Programming 3 06-28-2008 06:14 PM


All times are GMT -5. The time now is 09:31 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration