LinuxQuestions.org - Unique lines based on specific fields.

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - Unique lines based on specific fields. (https://www.linuxquestions.org/questions/programming-9/unique-lines-based-on-specific-fields-355038/)

bash: Unique lines based on specific fields.

I was wondering if anyone knows a simpler way to sort the lines in a file based on particular (non-adjacent) fields. Consider this sample file:

Code:

three apple 1

two banana 2

one pear 3

zero pineapple 10

one orange 5

one lime 3

two lemon 7

four grape 5

Say I want entries such that fields 1 and 3 must be unique for a given record. I know I can do:

Code:

$ awk '{print $1,$3,$2}' temp.txt | sort -k 1,2 -u | awk '{print $1,$3,$2}'

four grape 5

one pear 3

one orange 5

three apple 1

two banana 2

two lemon 7

zero pineapple 10

However, I was wondering if there was a more compact way to do this. Juggling the fields with awk and then juggling them back can be somewhat challenging, especially when there are a lot of fields in a record.

Any thoughts?

Code:

$ sort -k 1 -k 3 -u myfile

filters out entries where the combination of field 1 and 3 is unique, not each field in itself. This is what you want, right?

Otherwise, try

Code:

$ sort -k 1 -u | sort -k 3 -u

hth --Jonas

Quote:

Originally posted by jonaskoelker

Code:

$ sort -k 1 -k 3 -u myfile

filters out entries where the combination of field 1 and 3 is unique, not each field in itself. This is what you want, right?

Otherwise, try

Code:

$ sort -k 1 -u | sort -k 3 -u

hth --Jonas

Well, you understand what I want to do. However, neither of the solutions you proposed seems to work, though.

Code:

$ sort -k 1 -k 3 -u temp.txt



four grape 5

one lime 3

one orange 5

one pear 3

three apple 1

two banana 2

two lemon 7

zero pineapple 10



$ sort -k 1 -u temp.txt | sort -k 3 -u



three apple 1

zero pineapple 10

two banana 2

one lime 3

four grape 5

two lemon 7

The output in the first case contains a duplicate ("one lime 3" and "one pear 3").
The output in the second case eliminated "one orange 5", which was unique.

I kept scratching my head because I thought the first soulution ought to work. Then I tried:

Code:

sort -k 1,1 -k 3 -u temp.txt



four grape 5

one lime 3

one orange 5

one pear 3

three apple 1

two banana 2

two lemon 7

zero pineapple 10

And it gave me the result I was looking for. After studying the man page, I think it is because if you only specify one argument for the key, it sorts from that field to the last field. So in essence, the sort was by f1,f2,f3,f3. All the lines were considered unique because all the fields were included.

Thanks, I knew there had to be an easier way!

EDIT: I accidently posted the wrong output in the final solution. Corrected in my next post.

Quote:

I kept screating my head because I though the first solution ought to work...

Well, I missed a thing or two--but it got you in the right direction :-)

Congrats on getting it solved.

--Jonas

Quote:

And it gave me the result I was looking for.

I was going to post that solution yesterday, but it doesn't give you the result you said you wanted in your first post! You've got

Code:

one lime 3

one pear 3

in your result, which you said you didn't want, "one" and "3" being a duplicate.

Quote:

Originally posted by eddiebaby1023
I was going to post that solution yesterday, but it doesn't give you the result you said you wanted in your first post! You've got

Code:

one lime 3 one pear 3

in your result, which you said you didn't want, "one" and "3" being a duplicate.

My bad. I think I just copied the wrong output into my last post.

The actual output I get is:

Code:

$ sort -k 1,1 -k 3 -u temp.txt



four grape 5

one pear 3

one orange 5

three apple 1

two banana 2

two lemon 7

zero pineapple 10

Quote:

Originally posted by carl.waldbieser
After studying the man page, I think it is because if you only specify one argument for the key, it sorts from that field to the last field. So in essence, the sort was by f1,f2,f3,f3. All the lines were considered unique because all the fields were included.

There is no "I think" about it -- you are exactly right about "f1,f2,f3,f3".

BTW, it's not in the man page (perhaps in the <shudder /> info page), but you can fine tune your keys to the character position:

Code:

sort -k f.n,g.m

where f & g are field numbers and n & m are position numbers.