LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   uniq values in unsorted file (https://www.linuxquestions.org/questions/linux-newbie-8/uniq-values-in-unsorted-file-4175471813/)

upendra_35 08-02-2013 01:11 AM

uniq values in unsorted file
 
Hi, I am trying to count uniq values in a file but having trouble counting them because of multiple unconsecutive occurances of that value .

For example for this file
Code:

word
word
other
word
word

what i am getting back with uniq is this
Code:

2 word
1 other
2 word

but what i want is this
Code:

4 word
1 other

I know i can sort the list and then do uniq but i don't want to sort the list. Can anybody point any tool to do this...

Thanks
Upendra

astrogeek 08-02-2013 01:15 AM

Quote:

Originally Posted by upendra_35 (Post 5001376)
I know i can sort the list and then do uniq but i don't want to sort the list. Can anybody point any tool to do this...

I assume that you mean you do not want to pre-sort the file itself, but piping it through sort would be as easy as using uniq without actually changing the file.

Code:

cat file.txt | sort | uniq -c

chrism01 08-02-2013 01:20 AM

Why not sort the list; you don't have to save it (the sorted list).
Something like
Code:

sort file|uniq

upendra_35 08-02-2013 01:27 AM

Quote:

Originally Posted by astrogeek (Post 5001378)
I assume that you mean you do not want to pre-sort the file itself, but piping it through sort would be as easy as using uniq without actually changing the file.

Code:

cat file.txt | sort | uniq -c

I don't want to sort the list because i want to preserve their location in the file as they appear originally. In other words i am counting a file that have a genes ordered according to they appear on the chromosomes and so sorting them would be useless for me.

Thanks
Upendra

astrogeek 08-02-2013 01:34 AM

Quote:

Originally Posted by upendra_35 (Post 5001385)
I don't want to sort the list because i want to preserve their location in the file as they appear originally. In other words i am counting a file that have a genes ordered according to they appear on the chromosomes and so sorting them would be useless for me.

But you said...

but what i want is this
Code:

4 word
1 other

Which IS a sort of sorts - the two "word"s AFTER "other" do not appear in their original order either.

Which is what the code I gave (using sort) would give you.

Just to be very clear, the both code examples given do NOT modify the file - it remains unsorted.

chrism01 08-02-2013 01:41 AM

Me too; the code I supplied (no need for cat) sorts on the fly... it does NOT change the original file or save the output to disk.

upendra_35 08-02-2013 01:48 AM

Quote:

Originally Posted by astrogeek (Post 5001390)
But you said...

but what i want is this
Code:

4 word
1 other

Which IS a sort of sorts - the two "word"s AFTER "other" do not appear in their original order either.

Which is what the code I gave (using sort) would give you.

Just to be very clear, the code I gave does NOT modify the file - it remains unsorted.


Your command actually sorts the list like this
Code:

cat linux_test.txt | sort | uniq -c
      1 other
      4 word

As i said before i wanted like this
Code:


      4 word
      1 other

Eventhough the four 2 'words' after other are gone but 'word' appeared before than 'other' and so this is fine. But i don't know how do i do this now.

Thanks
Upendra

konsolebox 08-02-2013 01:50 AM

Quote:

Originally Posted by upendra_35 (Post 5001385)
I don't want to sort the list because i want to preserve their location in the file as they appear originally. In other words i am counting a file that have a genes ordered according to they appear on the chromosomes and so sorting them would be useless for me.

The list would remain unsorted originally. The contents of it but not really the file itself would just be sorted through buffers in the process for uniq to be used well. How would the result differ for you?

And using gawk would be helpful. Probably works with other awk as well.
Code:

gawk -- '{ !a[$0]++ && ++c; } END { print c; } file.txt

astrogeek 08-02-2013 01:51 AM

[EDIT]You typed faster than I did... I defer to your post above[/EDIT]

So, what you really want is for the total to appear at first occurance order.

konsolebox 08-02-2013 01:55 AM

Quote:

Originally Posted by upendra_35 (Post 5001400)
i wanted like this
Code:


      4 word
      1 other


Here's an update sorry
Code:

gawk -- '{ ++a[$0]; } END { for (i in a) { print a[i] " " i;} }' file.txt

konsolebox 08-02-2013 02:39 AM

Seems like Gawk sorts the keys so we have to place them in another array:
Code:

gawk -- '{ if (!a[$0]++) b[c++] = $0; } END { for (i = 0; i < c; ++i) { k = b[i]; print a[k] " " k;} }' file.txt
Too much condensing so the script version of it could be
Code:

#!/usr/bin/env gawk -f
{
    if (!counts[$0]++) {
        keys[k++] = $0
    }
}
END {
    for (i = 0; i < k; ++i) {
        key = keys[i]
        print counts[key] " " key
    }
}

Code:

gawk -f script.awk -- file.txt

upendra_35 08-02-2013 11:00 AM

Quote:

Originally Posted by konsolebox (Post 5001429)
Seems like Gawk sorts the keys so we have to place them in another array:
Code:

gawk -- '{ if (!a[$0]++) b[c++] = $0; } END { for (i = 0; i < c; ++i) { k = b[i]; print a[k] " " k;} }' file.txt
Too much condensing so the script version of it could be
Code:

#!/usr/bin/env gawk -f
{
    if (!counts[$0]++) {
        keys[k++] = $0
    }
}
END {
    for (i = 0; i < k; ++i) {
        key = keys[i]
        print counts[key] " " key
    }
}

Code:

gawk -f script.awk -- file.txt

Really worked like a charm. Thanks for your help.

grail 08-02-2013 01:13 PM

Here's a ruby variation on the theme:
Code:

ruby -ne 'BEGIN{a=Hash.new(0)}; a[$_]+=1; END{ a.each{|k,v| puts "#{v} #{k}" } }' file

konsolebox 08-02-2013 03:11 PM

Quote:

Originally Posted by grail (Post 5001744)
Code:

ruby -ne 'BEGIN{a=Hash.new(0)}; a[$_]+=1; END{ a.each{|k,v| puts "#{v} #{k}" } }' file

Tried it on Ruby 1.8 and it seems like you need a variation to make it work properly:
Code:

ruby -e 'a = Hash.new(0); b = Array.new; c = 0; while gets(); k = $_.chomp; a[k] += 1; if a[k] == 1; b[c] = k; c += 1; end; end; b.each {|k| puts "#{a[k]} #{k}"}'
Also it's odd since starting 1.9, keys are meant to be ordered, although it doesn't necessarily mean the elements themselves. Still I wonder if there has been a fix or change of behaviour for it somewhere.

grail 08-02-2013 11:27 PM

hmmm ... I am running 2.0, so are you saying that in 1.8 when using my script it sorts the data so 'other' appears first?


ahhh ... just looked this up:

1.8
Code:

The order in which you traverse a hash by either key or value may seem arbitrary, and will generally not be in the insertion order.
2.0
Code:

Hashes enumerate their values in the order that the corresponding keys were inserted.
That would explain it :)


All times are GMT -5. The time now is 08:33 PM.