LinuxQuestions.org - [SOLVED] CUT | SORT | UNIQ -D

- Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)

- - CUT | SORT | UNIQ -D | Line number of original file? (https://www.linuxquestions.org/questions/linux-newbie-8/cut-%7C-sort-%7C-uniq-d-%7C-line-number-of-original-file-940985/)

mannoj87

04-21-2012 05:01 AM

CUT | SORT | UNIQ -D | Line number of original file?

Hi I'm trying to get this some way.. hope its not challenging for others.. I have just started to learn shell..
Below is my reqmnt..

I'm taking out uniq lines and duplicate lines for the categorized fields in cut, with the number of times duplicates occured group by the categorized fields. But the challenge is I once I sort and find the uniq and duplicates with count(dups) seperated I should take their line number of the first occurence from the original file and tag it any where in below output. Not sure if I have explained what I need rightly.. any thoughts to print the line number whereas ignore-chars in uniq removes the char from printing itself.

cut -d',' -f2,3,6,15 n_renewstatus_2012-04-19.txt |sort | uniq -d -c | head -3

2 GAFD,919702214713,SUCCESS,20120419
2 GAFD,919795928292,SUCCESS,20120419
2 GAJD,919553311089,SUCCESS,20120419

---> I need the line number of the first occurence of duplicate any where in its line. uniq -d doesn't allow to bring into the picture.

grail

04-21-2012 05:11 AM

I would have to say I am not clear on what you want :( Maybe you could provide sample data to explain each step and the output along the way to the final output?

mannoj87

04-21-2012 05:47 AM

The given file is in the below format.

GGRPW,33332211,kr,P,SUCCESS,systemrenewal,REN,RAMS,SAA,0080527763,on:X,10.0,N,20120419,migr
GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr
GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr

I need to take out duplicates and count(each duplicates categorized by f1,2,5,14). Then insert into database with the first duplicate occurence record entire fields and tag the count(dups) in another column. For this I need to cut all the 4 mentioned fields and sort and find the dups using uniq -d and for counts I used -c. Now again coming back after all sorting out of dups and it counts I need the output to be in the below form.

3,GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr
Whereas three being the number of repeated dups for f1,2,5,14 and rest of the fields can be from any of the dup rows.

By this way dups should be removed from the original file and show in the above format.
And the remaining in the original file will be uniq ones they go as it is...

----

What I have done is.. let me not confuse.. this needs a different point of view.. and my brain is clinging on my approach.. need a cigar..
Any thots...??

jschiwal

04-21-2012 06:26 AM

I don't think the count of duplicates will help you.

consider this example:

Code:

cat -n test | sort -k2 | uniq -f1 -d

    1  a1210

    24  a1213

cat -n test | sort -k2 | uniq -f1 -D

    1  a1210

    23  a1210

    25  a1210

    24  a1213

    4  a1213

Using "cat -n" adds line numbers to the file.
Opps, I just noticed that field 1 is out of order.

Code:

cat -n test | sort -k2 -k1n | uniq -f1 -d

    1  a1210

    4  a1213

cat -n test | sort -k2 -k1n | uniq -f1 -D

    1  a1210

    23  a1210

    25  a1210

    4  a1213

    24  a1213

That's better.

Code:

while read linenum pattern ; do echo -en "$linenum " && grep  $pattern firstdupes ; done <dupes 

23      1      a1210

25      1      a1210

24      4      a1213

I redirected the first list to "firstdupes" and the second list to "alldupes"

Code:

grep -v -f firstdupes alldupes

    23  a1210

    25  a1210

    24  a1213

Let's redirect that to "dupes"

Code:

cat -n test | sort -k2 -k1n | uniq -f1 -d >firstdupes

cat -n test | sort -k2 -k1n | uniq -f1 -D >alldupes

grep -v -f firstdupes alldupes >dupes

Now let's produce a list containing the line numbers for duplicate lines, along with the line number of the first occurrence (on the same line entry)

Code:

while read linenum pattern ; do echo -en "$linenum\t" && grep  $pattern firstdupes | cut -d1-2; done <dupes 

23      1

25      1

24      4

The first field contains the line number of the duplicate line. The second field contains the line number of the original. Now for each line in field 1, tag with value of field 2. You could do this in a for loop or perhaps generate a SED or AWK script from these values.

jschiwal

04-21-2012 06:49 AM

Sorry, as I was preparing my response, your post above explained that you want to remove the duplicates, not tag them.

If you take the first field from the "dupes", you have the line numbers to remove. You can remove them with a sed or awk script. This will generate the sed script to delete those lines:

Code:

grep -v -f firstdupes alldupes | cut -f1 | sed 's# *\([[:digit:]]*\)#/\1/d#'

/23/d

/25/d

/24/d

Code:

grep -v -f firstdupes alldupes | cut -f1 | sed 's# *\([[:digit:]]*\)#/\1/d#' >removedupes.sed

sed -f removedupes.sed test

Then you could create a separate script where instead of deleting, that line is printed.

Code:

grep -v -f firstdupes alldupes | cut -f1 | sed 's# *\([[:digit:]]*\)#/\1/p#' savedupes.sed

sed -n -f savedupes.sed test

---
If you have a problem explaining the problem, you will have an even harder time finding a solution. Try to define the problem (to yourself) so it is crystal clear. After that, the solution will be easier to find.
---
Your posted sample is small, and doesn't have duplicates. Maybe this one would be better:

Code:

GGRPW,33332211,kr,P,SUCCESS,systemrenewal,REN,RAMS,SAA,0080527763,on:X,10.0,N,20120419,migr

GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr

GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr

GGRPW,33332211,kr,P,SUCCESS,systemrenewal,REN,RAMS,SAA,0080527763,on:X,10.0,N,20120419,migr

GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr

GGRPW,33332211,kr,P,SUCCESS,systemrenewal,REN,RAMS,SAA,0080527763,on:X,10.0,N,20120419,migr

GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr

GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr

GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr

GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr

GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr

grail

04-21-2012 08:30 AM

Well ... still appear to have my thick ears on :(, but I am hoping I am partially getting it.

1. Is it necessary to count any of this? (as mentioned by jschiwal)

2. Is the idea to create a file that has only uniq first entries based on the fields you mentioned?

If the answers are no and yes, how about (untested):

Code:

awk -F, '!(($1 SUBSEP $2 SUBSEP $5 SUBSEP $14) in uniq){uniq[$1,$2,$5,$14];print}' file

You can then redirect and rename the file and so on as you wish. To give you an idea, if we use jschiwal's example the output is:

Code:

GGRPW,33332211,kr,P,SUCCESS,systemrenewal,REN,RAMS,SAA,0080527763,on:X,10.0,N,20120419,migr

GBRPW,1232221,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASD,20075578623,on:X,1.0,N,20120419,migr

GLSH,21122111,uw,P,SUCCESS,systemrenewal,REN,RAMS,ASA,0264993503,on:X,10.0,N,20120419,migr

mannoj87

04-21-2012 10:49 AM

Sorry mates for the confusion.. Let me make this time clearer..

Below is the raw file which is given to me.

I have to cut and see the duplicates only for the fields 1,2,5,14 existing in any other rows.

GLUTW,91672651,tn,P,SUCCESS,systemrenewal,REN,s,ss,7311598203,cX,10.0,N,20120419,migr
GLUTW,91672651,tn,P,SUCCESS,systemrenewal,REN,s,ss,7311598203,cX,10.0,N,20120419,migr
GLUTW,91671651,tn,P,SUCCESS,systemrnewal,REN,s,ss,7311598203,cX,10.0,N,20120419,migr
GLUTW,91671651,tn,P,SUCCESS,systemrnewal,REN,s,ss,7311598203,cX,10.0,N,20120419,migr
GLUTW,91671651,tn,P,SUCCESS,systemrnewal,REN,s,ss,7311598203,cX,10.0,N,20120419,migr
GLUTW,916716511,tn,P,SUCCESS,systemrnewal,REN,s,ss,7311598203,cX,10.0,N,20120419,migr

If yes then take the count of duplicate rows and keep the first occurence of duplicate row(all fields) with the count value added at the end. Eg. should return in the below way for the above raw file.

GLUTW,91672651,tn,P,SUCCESS,systemrenewal,REN,s,ss,7311598203,cX,10.0,N,20120419,migr (2)
GLUTW,91671651,tn,P,SUCCESS,systemrnewal,REN,s,ss,7311598203,cX,10.0,N,20120419,migr (3)

And put the uniq ones in a different file

GLUTW,916716511,tn,P,SUCCESS,systemrnewal,REN,s,ss,7311598203,cX,10.0,N,20120419,migr (1)

Thanks for your replies.. !!

pan64

04-21-2012 11:19 AM

I would rather solve it in perl or awk. You can read the lines, parse and sort them, whatever you want. A comment: we do not need to sort lines (in perl) to find duplicates and count occurences. We just need to read lines and use a counter. Finally you can select lines by counter or sort by any key. Also I do not know if first occurrence is important, but it is also an additional variable, so one additional line in the script (storing the line numbers is also similar)

grail

04-21-2012 12:42 PM

So then something like:

Code:

awk -F, '!(($1 SUBSEP $2 SUBSEP $5 SUBSEP $14) in uniq){uniq[$1,$2,$5,$14]=$0}{count[$1,$2,$5,$14]++}END{for(i in count){if(count[i] > 1)file="dupes";else file="uniq";print uniq[i],"("count[i]")" > file}}' orig_file

This will create 2 files, dupes and uniq

mannoj87

04-22-2012 01:53 AM

Genius.. !! Such a powerful syntax reduced time totally to nanosecs...

OUTPUT :

Instead of () I added , and found the ratio differences in the below manner. And IT EXACTLY MATCHED with the total number of ROWS in the original file.

awk -F, '!(($1 SUBSEP $2 SUBSEP $5 SUBSEP $14) in uniq){uniq[$1,$2,$5,$14]=$0}{count[$1,$2,$5,$14]++}END{for(i in count){if(count[i] > 1)file="dupes";else file="uniq";print uniq[i],","count[i] > file}}' renewstatus_2012-04-19.txt

sym@localhost:~$ cut -f16 -d',' uniq | sort | uniq -d -c
124275 1 -----> SUM OF UNIQ ( 1 )ENTRIES

sym@localhost:~$ cut -f16 -d',' dupes | sort | uniq -d -c
3860 2
850 3
71 4
7 5
3 6
sym@localhost:~$ cut -f16 -d',' dupes | sort | uniq -u -c
1 7
------------------
10614 ------> SUM OF DUPLICATE ENTRIES MULTIPLIED WITH ITS COUNTS

sym@localhost:~$ wc -l renewstatus_2012-04-19.txt
134889 renewstatus_2012-04-19.txt ---> TOTAL LINE COUNTS OF THE ORIGINAL FILE, MATCHED EXACTLY WITH (124275+10614) = 134889

Brilliant play with AWKS... Thanks a lot Grail !!! You were a delight.. I will practice on awks more ..!!

grail

04-22-2012 03:12 AM

Glad it helped ... here is a reference for you to learn some more - http://www.gnu.org/software/gawk/man...ode/index.html

mannoj87

04-22-2012 04:10 AM

nice thanks for the link too.. ;)

jschiwal

04-22-2012 07:55 AM

Nice one Grail.

allend

04-22-2012 08:54 AM

+1 to that. It is beautiful to watch an awk expert at work. :-)

All times are GMT -5. The time now is 07:59 AM.