Equivalence classes, based on field values and multi-key hashtable

ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Welcome to LinuxQuestions.org, a friendly and active Linux Community.

You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!

Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.

Equivalence classes, based on field values and multi-key hashtable

Hello

I've got a set of objects (all of the same type). I'm trying to think of a good way to divide it into equivalence classes, with equivalence of two objects defined as meaning a specified set of attributes are equal for both objects.

More concretely, I've got:
- a Java class with around 50 fields
- a bunch of instances of the class

I want:
- to divide the instances into a few sets
- in each set, each instance has field 1 - field 5 equal to fields 1-5 of the other instances in the set.

The method I've come up with is to generate a hashcode for each instance based on the hashcodes of fields 1-5*, and map the hashcode to one of my sets. Ignoring problems with potential hashcode collisions (which I'm expecting to be too rare to worry about for now), does that sound reasonable? It seems simple enough, but I'm wondering if there's a simpler method I haven't thought of.

If anyone can think of a better method (quicker to implement/quicker to run/easier to maintain and extend), I'd like to know! Let me know if anything's unclear.

cheers,

OS

* I'll generate the hashcode using a method based on Eclipse's generic hashcode method, which looks like this:

Code:

public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ( ( f1 == null ) ? 0 : f1.hashCode() );
result = prime * result + ( ( f2 == null ) ? 0 : f2.hashCode() );
result = prime * result + ( ( f3 == null ) ? 0 : f3.hashCode() );
result = prime * result + ( ( f4 == null ) ? 0 : f4.hashCode() );
result = prime * result + ( ( f5 == null ) ? 0 : f5.hashCode() );
result = prime * result + ( ( f6 == null ) ? 0 : f6.hashCode() );
return result;
}

I've got a set of objects (all of the same type). I'm trying to think of a good way to divide it into equivalence classes, with equivalence of two objects defined as meaning a specified set of attributes are equal for both objects.

More concretely, I've got:
- a Java class with around 50 fields
- a bunch of instances of the class

I want:
- to divide the instances into a few sets
- in each set, each instance has field 1 - field 5 equal to fields 1-5 of the other instances in the set.

The method I've come up with is to generate a hashcode for each instance based on the hashcodes of fields 1-5*, and map the hashcode to one of my sets. Ignoring problems with potential hashcode collisions (which I'm expecting to be too rare to worry about for now), does that sound reasonable? It seems simple enough, but I'm wondering if there's a simpler method I haven't thought of.

If anyone can think of a better method (quicker to implement/quicker to run/easier to maintain and extend), I'd like to know! Let me know if anything's unclear.

cheers,

OS

* I'll generate the hashcode using a method based on Eclipse's generic hashcode method, which looks like this:

Code:

public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ( ( f1 == null ) ? 0 : f1.hashCode() );
result = prime * result + ( ( f2 == null ) ? 0 : f2.hashCode() );
result = prime * result + ( ( f3 == null ) ? 0 : f3.hashCode() );
result = prime * result + ( ( f4 == null ) ? 0 : f4.hashCode() );
result = prime * result + ( ( f5 == null ) ? 0 : f5.hashCode() );
result = prime * result + ( ( f6 == null ) ? 0 : f6.hashCode() );
return result;
}

Assuming the fields are strings why not simply concatenate them using a non-used in the character as the separator ? For example, an ASCII 1 (Control-A) character.

If the fields are not string and, say, they are numeric, they can first be trivially stringified.

I did wonder about using strings, and that's a good idea about using a control char as the separator, I'd then be able to guarantee there were no collisions, which I couldn't if I make a hashcode. I think it would be a bit slower, but with only 10,000 - 100,000 objects I might not notice the difference.

Are you performing clustering, or is the equivalence operator well-defined? In other words, do you know if two objects are equivalent without knowing about any of the other objects?
Kevin Barry

Hashing seems like a good idea, but you should consider an algorithm based on the limitations of the fields. For example, with normal text you're only using about 1/4 to 1/2 of the possible ASCII character values, so you might be able to compress the string before hashing (hashing is just an irreversible form of compression, anyway.) You also might consider making everything lower-case and removing punctuation before hashing. After all of this, you might actually be able to reduce your strings into something that can be directly compared without having to hash, thereby reducing the number of possible collisions.
Kevin Barry

LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.