LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 08-02-2013, 02:11 AM   #1
upendra_35
LQ Newbie
 
Registered: Oct 2012
Posts: 21

Rep: Reputation: Disabled
uniq values in unsorted file


Hi, I am trying to count uniq values in a file but having trouble counting them because of multiple unconsecutive occurances of that value .

For example for this file
Code:
word
word
other
word
word
what i am getting back with uniq is this
Code:
2 word
1 other
2 word
but what i want is this
Code:
4 word
1 other
I know i can sort the list and then do uniq but i don't want to sort the list. Can anybody point any tool to do this...

Thanks
Upendra
 
Old 08-02-2013, 02:15 AM   #2
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=14, FreeBSD_10{.0|.1|.2}
Posts: 3,879
Blog Entries: 1

Rep: Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998
Quote:
Originally Posted by upendra_35 View Post
I know i can sort the list and then do uniq but i don't want to sort the list. Can anybody point any tool to do this...
I assume that you mean you do not want to pre-sort the file itself, but piping it through sort would be as easy as using uniq without actually changing the file.

Code:
cat file.txt | sort | uniq -c

Last edited by astrogeek; 08-02-2013 at 02:16 AM. Reason: typo
 
Old 08-02-2013, 02:20 AM   #3
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,240

Rep: Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324
Why not sort the list; you don't have to save it (the sorted list).
Something like
Code:
sort file|uniq
 
Old 08-02-2013, 02:27 AM   #4
upendra_35
LQ Newbie
 
Registered: Oct 2012
Posts: 21

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by astrogeek View Post
I assume that you mean you do not want to pre-sort the file itself, but piping it through sort would be as easy as using uniq without actually changing the file.

Code:
cat file.txt | sort | uniq -c
I don't want to sort the list because i want to preserve their location in the file as they appear originally. In other words i am counting a file that have a genes ordered according to they appear on the chromosomes and so sorting them would be useless for me.

Thanks
Upendra
 
Old 08-02-2013, 02:34 AM   #5
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=14, FreeBSD_10{.0|.1|.2}
Posts: 3,879
Blog Entries: 1

Rep: Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998
Quote:
Originally Posted by upendra_35 View Post
I don't want to sort the list because i want to preserve their location in the file as they appear originally. In other words i am counting a file that have a genes ordered according to they appear on the chromosomes and so sorting them would be useless for me.
But you said...

but what i want is this
Code:
4 word
1 other
Which IS a sort of sorts - the two "word"s AFTER "other" do not appear in their original order either.

Which is what the code I gave (using sort) would give you.

Just to be very clear, the both code examples given do NOT modify the file - it remains unsorted.

Last edited by astrogeek; 08-02-2013 at 02:44 AM.
 
Old 08-02-2013, 02:41 AM   #6
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,240

Rep: Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324Reputation: 2324
Me too; the code I supplied (no need for cat) sorts on the fly... it does NOT change the original file or save the output to disk.
 
Old 08-02-2013, 02:48 AM   #7
upendra_35
LQ Newbie
 
Registered: Oct 2012
Posts: 21

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by astrogeek View Post
But you said...

but what i want is this
Code:
4 word
1 other
Which IS a sort of sorts - the two "word"s AFTER "other" do not appear in their original order either.

Which is what the code I gave (using sort) would give you.

Just to be very clear, the code I gave does NOT modify the file - it remains unsorted.

Your command actually sorts the list like this
Code:
cat linux_test.txt | sort | uniq -c
      1 other
      4 word
As i said before i wanted like this
Code:
 
      4 word
      1 other
Eventhough the four 2 'words' after other are gone but 'word' appeared before than 'other' and so this is fine. But i don't know how do i do this now.

Thanks
Upendra
 
Old 08-02-2013, 02:50 AM   #8
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235
Quote:
Originally Posted by upendra_35 View Post
I don't want to sort the list because i want to preserve their location in the file as they appear originally. In other words i am counting a file that have a genes ordered according to they appear on the chromosomes and so sorting them would be useless for me.
The list would remain unsorted originally. The contents of it but not really the file itself would just be sorted through buffers in the process for uniq to be used well. How would the result differ for you?

And using gawk would be helpful. Probably works with other awk as well.
Code:
gawk -- '{ !a[$0]++ && ++c; } END { print c; } file.txt
 
Old 08-02-2013, 02:51 AM   #9
astrogeek
Moderator
 
Registered: Oct 2008
Distribution: Slackware [64]-X.{0|1|2|37|-current} ::12<=X<=14, FreeBSD_10{.0|.1|.2}
Posts: 3,879
Blog Entries: 1

Rep: Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998Reputation: 1998
[EDIT]You typed faster than I did... I defer to your post above[/EDIT]

So, what you really want is for the total to appear at first occurance order.

Last edited by astrogeek; 08-02-2013 at 02:55 AM.
 
Old 08-02-2013, 02:55 AM   #10
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235
Quote:
Originally Posted by upendra_35 View Post
i wanted like this
Code:
 
      4 word
      1 other
Here's an update sorry
Code:
gawk -- '{ ++a[$0]; } END { for (i in a) { print a[i] " " i;} }' file.txt

Last edited by konsolebox; 08-02-2013 at 02:59 AM. Reason: c is no longer needed.
 
Old 08-02-2013, 03:39 AM   #11
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235
Seems like Gawk sorts the keys so we have to place them in another array:
Code:
gawk -- '{ if (!a[$0]++) b[c++] = $0; } END { for (i = 0; i < c; ++i) { k = b[i]; print a[k] " " k;} }' file.txt
Too much condensing so the script version of it could be
Code:
#!/usr/bin/env gawk -f
{
    if (!counts[$0]++) {
        keys[k++] = $0
    }
}
END {
    for (i = 0; i < k; ++i) {
        key = keys[i]
        print counts[key] " " key
    }
}
Code:
gawk -f script.awk -- file.txt

Last edited by konsolebox; 08-02-2013 at 06:29 AM. Reason: '
 
1 members found this post helpful.
Old 08-02-2013, 12:00 PM   #12
upendra_35
LQ Newbie
 
Registered: Oct 2012
Posts: 21

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by konsolebox View Post
Seems like Gawk sorts the keys so we have to place them in another array:
Code:
gawk -- '{ if (!a[$0]++) b[c++] = $0; } END { for (i = 0; i < c; ++i) { k = b[i]; print a[k] " " k;} }' file.txt
Too much condensing so the script version of it could be
Code:
#!/usr/bin/env gawk -f
{
    if (!counts[$0]++) {
        keys[k++] = $0
    }
}
END {
    for (i = 0; i < k; ++i) {
        key = keys[i]
        print counts[key] " " key
    }
}
Code:
gawk -f script.awk -- file.txt
Really worked like a charm. Thanks for your help.
 
Old 08-02-2013, 02:13 PM   #13
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,251

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
Here's a ruby variation on the theme:
Code:
ruby -ne 'BEGIN{a=Hash.new(0)}; a[$_]+=1; END{ a.each{|k,v| puts "#{v} #{k}" } }' file
 
Old 08-02-2013, 04:11 PM   #14
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235
Quote:
Originally Posted by grail View Post
Code:
ruby -ne 'BEGIN{a=Hash.new(0)}; a[$_]+=1; END{ a.each{|k,v| puts "#{v} #{k}" } }' file
Tried it on Ruby 1.8 and it seems like you need a variation to make it work properly:
Code:
ruby -e 'a = Hash.new(0); b = Array.new; c = 0; while gets(); k = $_.chomp; a[k] += 1; if a[k] == 1; b[c] = k; c += 1; end; end; b.each {|k| puts "#{a[k]} #{k}"}'
Also it's odd since starting 1.9, keys are meant to be ordered, although it doesn't necessarily mean the elements themselves. Still I wonder if there has been a fix or change of behaviour for it somewhere.
 
Old 08-03-2013, 12:27 AM   #15
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,251

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
hmmm ... I am running 2.0, so are you saying that in 1.8 when using my script it sorts the data so 'other' appears first?


ahhh ... just looked this up:

1.8
Code:
The order in which you traverse a hash by either key or value may seem arbitrary, and will generally not be in the insertion order.
2.0
Code:
Hashes enumerate their values in the order that the corresponding keys were inserted.
That would explain it
 
  


Reply

Tags
sort


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Get only uniq content from a file shivaa Linux - Newbie 5 10-13-2012 09:46 AM
[SOLVED] CUT | SORT | UNIQ -D | Line number of original file? mannoj87 Linux - Newbie 13 04-22-2012 09:54 AM
how to find a file with uniq extension abhigrkist Programming 5 12-22-2009 03:16 AM
Use uniq on first part of file but print whole line. snowman81 Programming 4 10-03-2009 07:22 AM
uniq/ awk/ or sed trying to get high / low values schneidz Programming 3 06-18-2008 03:30 PM


All times are GMT -5. The time now is 06:47 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration