Hash-sort like in PERL in C++

nyk · 06-11-2004, 04:02 AM

I have a nice algorithm in PERL that uses hashes and works fine. But it uses way too much memory. Thats why I want to do the same in C. Searching for an implementation of PERL hashes in C, the best I found was the C++ STL map container. I would have prefered pure C, but maybe hash problems are easier to solve in C++.
I want to base my program on the following example code from the web. This example just creates, fills and then prints a hash.
But my problem is now that I have to sort the hash by value (not by key!). I tried this with the sort command, but this doesn't work and just gives a mass of errors when compiling. Anybody know whats wrong about this sort command I inserted in the otherwise correctly functioning example?

#include <iostream>
#include <map>
#include <string>
using namespace std;

int main()
{
typedef map<string,float> StringFloatMap;
StringFloatMap coll;
// insert some elements into the collection
coll["VAT"] = 0.15;
coll["Pi"] = 3.1415;
coll["an arbitrary number"] = 4983.223;
coll["Null"] = 0;
StringFloatMap::iterator pos;

/* THIS IS THE COMMAND THAN DOES NOT WORK !!!!! */
sort(coll.begin(),coll.end());
/* ------*/

for (pos = coll.begin(); pos != coll.end(); ++pos) {
cout << "key: \"" << pos->first << "\" "
<< "value: " << pos->second << endl;
}
}

dakensta · 06-11-2004, 06:10 AM

You can't std::sort a map. It is by definition sorted by key.

In addition, if, for whatever reason, you want a hashed map, use hash_map.

I am not entirely sure what to suggest for you purposes -
maybe create another map, reversing the key and value,
or maybe just a std::set using the values,
or maybe create a class containing both your key and value and then use a std::vector and define two comparison functors to use with std::sort...

I am a lttle rusty on this stuff but if you describe your requirements a bit more someone else might come up with some suitable ideas.

nyk · 06-11-2004, 07:05 AM

thanks for your reply! ok, I could use hash_map, this looks also good, but I don't know the difference to the other map...

What I want to do is the following:
I have a protein sequence, that mean a string of characters. e.g. MAGRTGTGTGTGAAA
Now I want to find repeats in this string. To do this, I create hashes of all possible and reasonable substrings like e.g. "MA", "AG", "GR", "TG","GRTG","TGAAA". For each occurence of the same substring I increment the corresponding hash. So the "GT" hast would have the value 3 and so on. Because I'm interested mostly in the most frequent repeats, it would be the most efficient to sort the hash by value now, so I can interate and get "GT"=3 for this string as the first element. I did this in PERL and it works fine, but because it is now memory efficient, I want to do this in C. But how can I sort the hash_map by its value?

Thanks for any help!

nyk · 06-11-2004, 07:06 AM

I meant: It was NOT memory efficient in perl...

dakensta · 06-11-2004, 08:40 AM

You just have a trade off here between various features.

My first instinct is to say that in comparison to searching a DNA sequence, ordering the results is going to have a fairly small overhead, but that depends on the number of sequences you ae looking for - it may be 10, but it may be 10,000. I know of one friend who was doing something similar and they were only looking for a handful of sequences in sequences of hundrends of millions codon, so creating another map would carry almost no overhead in comparison. (Bear in mind it might, although unlikely need to be a multimap as you could have the same number of two sequences.)

Code:

map<string, int> found;
// search and increment
multimap<int,string> freq;
for (iter = found.begin() to found.end())
 freq[iter->second] = iter->first;

On the other hand, with large numbers of sequences you might just want to index each with a number in a std::vector (much as you are doing now but in a map) and create a vector of "sequence_counts" and a map for quick key lookup.

e.g. some rough pseudo-code

Code:

struct seq_count
{
  string name;
  int count;
  seq_count(string s) : name(s), count(0) {}

  // overload operator< to call sort with seq_count's
  bool operator< (const seq_count& rhs)
 {  return count < rhs.count; }
};

int main()
{
map<string,int> seq_table;
vector<seq_count> histogram;
// you might want to call reserve or vector ctor using the number of sequences

for (int i=0 to all sequences)
{
  seq_table["seq_str"] = i;
  histogram[i].name = "seq_str";
}

// search gene
histogram[seq_table["found_str"]].count++;

// sort the sequernces by value
sort(histogram.begin(), histogram.end());
}

Anyway, HTH