[SOLVED] uniq command not able to remove duplicate entries

ls_milkyway · 08-22-2013, 02:58 AM

Hi,

Removing duplicates using sort and then uniq command is not working on my file, which contains blacklisted urls.For eg.

10.44.56.78
10.44.56.78
10.wrs.org
10.wrs.org
baby.us
baby.us
zym.com
zym.com
.
.
.
So...on

The command
uniq input.txt > output.txt results in:

10.44.56.78
10.44.56.78
10.wrs.org
10.wrs.org
baby.us
zym.com
.
.
.
So...on

whereas, I want output:

10.44.56.78
10.wrs.org
baby.us
zym.com
.
.
.
So...on

Can you plz suggest how to remove these duplicates (ip addresses or integer values)??

Thanks in advance.

astrogeek · 08-22-2013, 03:03 AM

Quote:

Originally Posted by ls_milkyway

The command
uniq input.txt > output.txt results in:

No need for '>' redirect...

Code:

uniq input.txt output.txt

Should do it.

The redirect in your command simply told it to write input.txt to output.txt, as you see...

ls_milkyway · 08-22-2013, 03:20 AM

Ok! Thanks for quick response.
Well, you mean to say ">" simply redirecting input
But, still I am not able to understand that why duplicate entries of baby.us
zym.com were removed.

Firerat · 08-22-2013, 03:27 AM

should work.. so are they 'uniq' ?

lets look at uniq --help

tells you that it won't work unless lines are adjacent , it also hints that you can use sort -u

so try

Code:

sort -u

assuming that still doesn't work, we go back to my cryptic question.

Since you mention blacklists, I'm assuming you want to weed out duplicates from several lists, to make one big list.
So it is possible that some of those lists have 'Dos' EOL ( end of line ) while others are 'Unix' EOL

confirm with

Code:

cat -A input.txt

you may see some lines ending $ and some ^M$

so try

Code:

tr -d "\r" < input.txt | sort -u > output.txt

shm0 · 08-22-2013, 03:31 AM

Quote:

Originally Posted by ls_milkyway

Can you plz suggest how to remove these duplicates (ip addresses or integer values)??

Thanks in advance.

You must have some extra space/tab or probably some other hidden characters. I tried your list and it works fine, but when I added extra space to one of the lines, it appeared twice. So, remove that extra space and check.

astrogeek · 08-22-2013, 03:40 AM

Quote:

Originally Posted by ls_milkyway

Ok! Thanks for quick response.
Well, you mean to say ">" simply redirecting input
But, still I am not able to understand that why duplicate entries of baby.us
zym.com were removed.

As firerat says, there may be different line endings if they come from different sources.

Also, now that I think about it, it is not entirely clear whether your file is sorted. Are you running sort on some sources then redirecting output to a the file for uniq? Or are you running sort on the file and expecting it to be sorted?

Can you verify that the file used by uniq is actually sorted... sanity check.

astrogeek · 08-22-2013, 03:55 AM

I offer the following script to strip M$ line endings from the file.

Copy paste to a file, I name it undos, make it executable, then ./undos filename.txt.

NOTE: USE AT YOUR OWN RISK!! It will prompt you for the --really option to confirm use!

But it should work well enough for this...

Code:

#!/bin/bash
#Quick util to strip \r from text files

if [[ $# == 0 ]]
then
        echo "Usage $0 filename --really"
        exit;
fi

for what in $*
do
        if [[ $what == '--really' ]]
        then
                ok=1
        fi
done

if [[ $ok == 1 ]]
then
sed 's/\r//g' $1 -i
else
        echo "You are about to strip characters from a file, --really to continue!"
fi

So, to be clear....

Code:

sort sorrcefile(s) > sorted.txt
./undos sorted.txt
uniq sorted.txt outfile.txt

That should get you there

ls_milkyway · 08-22-2013, 04:02 AM

Thanks Firerat you got it!

Yes, astrogeek, file was sorted using sort command

The lines have DOS EOL so needs to be converted by tr or dos2unix.

Thanks to all.

grail · 08-22-2013, 05:18 AM

How about change the tool:

Code:

awk '!_[$1]++' RS="[\n\r]+" file