LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   Unique lines based on specific fields. (https://www.linuxquestions.org/questions/programming-9/unique-lines-based-on-specific-fields-355038/)

carl.waldbieser 08-19-2005 07:46 PM

bash: Unique lines based on specific fields.
 
I was wondering if anyone knows a simpler way to sort the lines in a file based on particular (non-adjacent) fields. Consider this sample file:

Code:

three apple 1
two banana 2
one pear 3
zero pineapple 10
one orange 5
one lime 3
two lemon 7
four grape 5

Say I want entries such that fields 1 and 3 must be unique for a given record. I know I can do:
Code:

$ awk '{print $1,$3,$2}' temp.txt | sort -k 1,2 -u | awk '{print $1,$3,$2}'
four grape 5
one pear 3
one orange 5
three apple 1
two banana 2
two lemon 7
zero pineapple 10

However, I was wondering if there was a more compact way to do this. Juggling the fields with awk and then juggling them back can be somewhat challenging, especially when there are a lot of fields in a record.

Any thoughts?

jonaskoelker 08-20-2005 08:03 PM

Code:

$ sort -k 1 -k 3 -u myfile
filters out entries where the combination of field 1 and 3 is unique, not each field in itself. This is what you want, right?

Otherwise, try
Code:

$ sort -k 1 -u | sort -k 3 -u
hth --Jonas

carl.waldbieser 08-21-2005 01:27 AM

Quote:

Originally posted by jonaskoelker
Code:

$ sort -k 1 -k 3 -u myfile
filters out entries where the combination of field 1 and 3 is unique, not each field in itself. This is what you want, right?

Otherwise, try
Code:

$ sort -k 1 -u | sort -k 3 -u
hth --Jonas

Well, you understand what I want to do. However, neither of the solutions you proposed seems to work, though.

Code:

$ sort -k 1 -k 3 -u temp.txt

four grape 5
one lime 3
one orange 5
one pear 3
three apple 1
two banana 2
two lemon 7
zero pineapple 10

$ sort -k 1 -u temp.txt | sort -k 3 -u

three apple 1
zero pineapple 10
two banana 2
one lime 3
four grape 5
two lemon 7

The output in the first case contains a duplicate ("one lime 3" and "one pear 3").
The output in the second case eliminated "one orange 5", which was unique.

I kept scratching my head because I thought the first soulution ought to work. Then I tried:
Code:

sort -k 1,1 -k 3 -u temp.txt

four grape 5
one lime 3
one orange 5
one pear 3
three apple 1
two banana 2
two lemon 7
zero pineapple 10

And it gave me the result I was looking for. After studying the man page, I think it is because if you only specify one argument for the key, it sorts from that field to the last field. So in essence, the sort was by f1,f2,f3,f3. All the lines were considered unique because all the fields were included.

Thanks, I knew there had to be an easier way!

EDIT: I accidently posted the wrong output in the final solution. Corrected in my next post.

jonaskoelker 08-21-2005 05:07 AM

Quote:

I kept screating my head because I though the first solution ought to work...
Well, I missed a thing or two--but it got you in the right direction :-)

Congrats on getting it solved.

--Jonas

eddiebaby1023 08-21-2005 09:02 AM

Quote:

And it gave me the result I was looking for.
I was going to post that solution yesterday, but it doesn't give you the result you said you wanted in your first post! You've got
Code:

one lime 3
one pear 3

in your result, which you said you didn't want, "one" and "3" being a duplicate.

carl.waldbieser 08-21-2005 10:31 AM

Quote:

Originally posted by eddiebaby1023
I was going to post that solution yesterday, but it doesn't give you the result you said you wanted in your first post! You've got
Code:

one lime 3
one pear 3

in your result, which you said you didn't want, "one" and "3" being a duplicate.

My bad. I think I just copied the wrong output into my last post.

The actual output I get is:
Code:

$ sort -k 1,1 -k 3 -u temp.txt

four grape 5
one pear 3
one orange 5
three apple 1
two banana 2
two lemon 7
zero pineapple 10


archtoad6 08-21-2005 02:26 PM

Quote:

Originally posted by carl.waldbieser
After studying the man page, I think it is because if you only specify one argument for the key, it sorts from that field to the last field. So in essence, the sort was by f1,f2,f3,f3. All the lines were considered unique because all the fields were included.

There is no "I think" about it -- you are exactly right about "f1,f2,f3,f3".

BTW, it's not in the man page (perhaps in the <shudder /> info page), but you can fine tune your keys to the character position:
Code:

sort -k f.n,g.m
where f & g are field numbers and n & m are position numbers.


All times are GMT -5. The time now is 04:23 AM.