Delete duplicates without using sort -u?

MikeyCarter · 10-22-2012, 12:07 PM

I have a case where if I do sort -u on some foreign characters it removes both lines, instead of just one. (only happens for a few usernames which is odd)

I figure it's a bug with sort (GNU coreutils) 5.97 but I won't be able to get the sys-admins to patch the system.

So is there a way of removing duplicate lines from a file with another tool?

schneidz · 10-22-2012, 12:09 PM

does

Code:

sort | uniq

work ?

Didier Spaier · 10-22-2012, 12:19 PM

Code:

#/bin/bash
rm -f withoutduplicates.txt
cat yourfile.txt|sort|awk '
    BEGIN {prev=""}
    {if (prev != $0) {print >> "withoutduplicates.txt"}
    prev=$0
}'

colucix · 10-22-2012, 12:22 PM

Code:

awk '!_[$0]++' file

MikeyCarter · 10-22-2012, 01:10 PM

Turns out I found the "bug" -- mostly in my head --

| LC_COLLATE=C sort -u

That did the trick.

I'm keeping the other solutions on hand in case something else comes up.

Thanks for all your help.

syg00 · 10-22-2012, 02:44 PM

There are occasions where sort is inappropriate, so @colucix has a useful answer. Of course, not all awk behaves as expected - I have a SunOS awk doing mighty strange things at present.

colucix · 10-22-2012, 03:24 PM

Quote:

Originally Posted by syg00

There are occasions where sort is inappropriate, so @colucix has a useful answer. Of course, not all awk behaves as expected - I have a SunOS awk doing mighty strange things at present.

Yes, other users reported that this simple syntax doesn't work on SunOS awk. I cannot explain what is the reason, since awk on SunOS has the ! and ++ operators, it has the same concept of true and false and referencing a non-existent array element creates that element and returns the null string (false). I don't see any other rule in action here, that might eventually be specific to GNU awk. Someone reported that awk on SunOS is buggy, but I cannot verify it. Anyway, just out of curiosity, what do you get by running the suggested code on SunOS?

David the H. · 10-23-2012, 01:46 AM

According to this page (#43), the following variation is more efficient. I imagine it's more likely to work properly on SunOS too.

Code:

awk '!($0 in a) { a[$0]; print }'

syg00 · 10-23-2012, 03:19 AM

Quote:

Originally Posted by colucix

Anyway, just out of curiosity, what do you get by running the suggested code on SunOS?

Nothing.
As it happened I was in the mood to try a few things before I saw these responses. The ++ post operator works, the not (!) doesn't - if expanded to full "if (_[$0]++ != 0) <blah> ..." it works as expected.
I also wanted to replicate the (single) data in each line - simple. "awk '{print $0,$0}' file"
I wish ...

Testing with strings of "..." and "\t" strategically placed seemed to indicate the first field was always dropped - unless it was the only field. This was true using $0 or $1 or $NF in the command.

And of course sed wasn't smart enough to allow me to do anything useful either.
I could get *really* attached to the GNU extensions ...