Equivalence classes, based on field values and multi-key hashtable

openSauce · 05-12-2010, 07:12 AM

Hello

I've got a set of objects (all of the same type). I'm trying to think of a good way to divide it into equivalence classes, with equivalence of two objects defined as meaning a specified set of attributes are equal for both objects.

More concretely, I've got:
- a Java class with around 50 fields
- a bunch of instances of the class

I want:
- to divide the instances into a few sets
- in each set, each instance has field 1 - field 5 equal to fields 1-5 of the other instances in the set.

The method I've come up with is to generate a hashcode for each instance based on the hashcodes of fields 1-5*, and map the hashcode to one of my sets. Ignoring problems with potential hashcode collisions (which I'm expecting to be too rare to worry about for now), does that sound reasonable? It seems simple enough, but I'm wondering if there's a simpler method I haven't thought of.

If anyone can think of a better method (quicker to implement/quicker to run/easier to maintain and extend), I'd like to know! Let me know if anything's unclear.

cheers,

OS

* I'll generate the hashcode using a method based on Eclipse's generic hashcode method, which looks like this:

Code:

  public int hashCode() {
    final int prime = 31;
    int result = 1;
    result = prime * result + ( ( f1 == null ) ? 0 : f1.hashCode() );
    result = prime * result + ( ( f2 == null ) ? 0 : f2.hashCode() );
    result = prime * result + ( ( f3 == null ) ? 0 : f3.hashCode() );
    result = prime * result + ( ( f4 == null ) ? 0 : f4.hashCode() );
    result = prime * result + ( ( f5 == null ) ? 0 : f5.hashCode() );
    result = prime * result + ( ( f6 == null ) ? 0 : f6.hashCode() );
    return result;
  }

Sergei Steshenko · 05-13-2010, 12:10 AM

Quote:

Originally Posted by openSauce

Hello

I've got a set of objects (all of the same type). I'm trying to think of a good way to divide it into equivalence classes, with equivalence of two objects defined as meaning a specified set of attributes are equal for both objects.

More concretely, I've got:
- a Java class with around 50 fields
- a bunch of instances of the class

I want:
- to divide the instances into a few sets
- in each set, each instance has field 1 - field 5 equal to fields 1-5 of the other instances in the set.

The method I've come up with is to generate a hashcode for each instance based on the hashcodes of fields 1-5*, and map the hashcode to one of my sets. Ignoring problems with potential hashcode collisions (which I'm expecting to be too rare to worry about for now), does that sound reasonable? It seems simple enough, but I'm wondering if there's a simpler method I haven't thought of.

If anyone can think of a better method (quicker to implement/quicker to run/easier to maintain and extend), I'd like to know! Let me know if anything's unclear.

cheers,

OS

* I'll generate the hashcode using a method based on Eclipse's generic hashcode method, which looks like this:

Code:

  public int hashCode() {
    final int prime = 31;
    int result = 1;
    result = prime * result + ( ( f1 == null ) ? 0 : f1.hashCode() );
    result = prime * result + ( ( f2 == null ) ? 0 : f2.hashCode() );
    result = prime * result + ( ( f3 == null ) ? 0 : f3.hashCode() );
    result = prime * result + ( ( f4 == null ) ? 0 : f4.hashCode() );
    result = prime * result + ( ( f5 == null ) ? 0 : f5.hashCode() );
    result = prime * result + ( ( f6 == null ) ? 0 : f6.hashCode() );
    return result;
  }

Assuming the fields are strings why not simply concatenate them using a non-used in the character as the separator ? For example, an ASCII 1 (Control-A) character.

If the fields are not string and, say, they are numeric, they can first be trivially stringified.

openSauce · 05-13-2010, 02:24 AM

I did wonder about using strings, and that's a good idea about using a control char as the separator, I'd then be able to guarantee there were no collisions, which I couldn't if I make a hashcode. I think it would be a bit slower, but with only 10,000 - 100,000 objects I might not notice the difference.

ta0kira · 05-15-2010, 09:08 AM

Are you performing clustering, or is the equivalence operator well-defined? In other words, do you know if two objects are equivalent without knowing about any of the other objects?
Kevin Barry

openSauce · 05-15-2010, 02:34 PM

Yes, it's a well-defined equivalence operation: two objects are equivalent if they have equal values in fields 1-5.

ta0kira · 05-16-2010, 08:35 PM

Hashing seems like a good idea, but you should consider an algorithm based on the limitations of the fields. For example, with normal text you're only using about 1/4 to 1/2 of the possible ASCII character values, so you might be able to compress the string before hashing (hashing is just an irreversible form of compression, anyway.) You also might consider making everything lower-case and removing punctuation before hashing. After all of this, you might actually be able to reduce your strings into something that can be directly compared without having to hash, thereby reducing the number of possible collisions.
Kevin Barry