removing duplicate entries from a text file

dunryc · 12-03-2013, 05:05 AM

HI Guys i have a file as below

Quote:

oranges
apples
lemons
grapes
pears
grapes
grapes
lemons
ornages

I want to display only lines that have not been duplicated i dont want to just remove the duplicates

Quote:

pears
apples

any way to do this from the cli ?

Many thanks Pete

NevemTeve · 12-03-2013, 05:07 AM

Code:

sort <input | uniq -u >output

dunryc · 12-03-2013, 05:18 AM

Quote:

Originally Posted by NevemTeve

Code:

sort <input | uniq -u >output

this is going to help me out on a daily basis thanks alot and in future ill be sure to read the man [ages a little more !

sundialsvcs · 12-03-2013, 06:52 AM

Quote:

Originally Posted by dunryc

... and in future ill be sure to read the man pages a little more!

You definitely should. Unix/Linux is fairly stuffed with an odd assortment of very-useful things ... not to mention a generous helping of true programming-languages ... all of it free. It's quite entertaining to mosey around /usr/bin, say, and to think, "gee, wonder what that's for?"

For example, if you really need to "do-the-business with a text file," check out awk.

gnashley · 12-03-2013, 09:51 AM

Code:

sort <input | uniq -u >output

doesn't do what the OP asked anyway:

Quote:

bash-4.1$ sort <input | uniq -u
apples
oranges
ornages
pears

NevemTeve · 12-03-2013, 10:49 AM

You mean it should have found out that 'oranges' = 'ornages'?

PTrenholme · 12-03-2013, 06:04 PM

Another solution:

Code:

$ gawk '{++x[$0]} END{for(i in x){if (x[i]==1){print i}}}' test.dat
apples
ornages
oranges
pears

And, if you want to "fix" the orange problem, try

Code:

$ /usr/share/awk/soundex.awk test.dat | gawk 'NF==2{print $2}'
apples
pears

danielbmartin · 12-04-2013, 05:38 AM

Notice something nice about the method of PTrenholme: it is easily scalable.

Code:

echo "Identify input lines which appeared EXACTLY once"
awk '{a[$0]++} END{for (j in a) if (a[j]==1) print j}' $InFile >$OutFile

echo "Identify input lines which appeared EXACTLY twice"
awk '{a[$0]++} END{for (j in a) if (a[j]==2) print j}' $InFile >$OutFile

echo "Identify input lines which appeared EXACTLY thrice"
awk '{a[$0]++} END{for (j in a) if (a[j]==3) print j}' $InFile >$OutFile

Daniel B. Martin