Equivalence classes, based on field values and multi-key hashtable

ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Welcome to LinuxQuestions.org, a friendly and active Linux Community.

You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!

Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.

If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.

Having a problem logging in? Please visit this page to clear all LQ-related cookies.

Introduction to Linux - A Hands on Guide

This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.

Click Here to receive this Complete Guide absolutely free.

Equivalence classes, based on field values and multi-key hashtable

Hello

I've got a set of objects (all of the same type). I'm trying to think of a good way to divide it into equivalence classes, with equivalence of two objects defined as meaning a specified set of attributes are equal for both objects.

More concretely, I've got:
- a Java class with around 50 fields
- a bunch of instances of the class

I want:
- to divide the instances into a few sets
- in each set, each instance has field 1 - field 5 equal to fields 1-5 of the other instances in the set.

The method I've come up with is to generate a hashcode for each instance based on the hashcodes of fields 1-5*, and map the hashcode to one of my sets. Ignoring problems with potential hashcode collisions (which I'm expecting to be too rare to worry about for now), does that sound reasonable? It seems simple enough, but I'm wondering if there's a simpler method I haven't thought of.

If anyone can think of a better method (quicker to implement/quicker to run/easier to maintain and extend), I'd like to know! Let me know if anything's unclear.

cheers,

OS

* I'll generate the hashcode using a method based on Eclipse's generic hashcode method, which looks like this:

Code:

public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ( ( f1 == null ) ? 0 : f1.hashCode() );
result = prime * result + ( ( f2 == null ) ? 0 : f2.hashCode() );
result = prime * result + ( ( f3 == null ) ? 0 : f3.hashCode() );
result = prime * result + ( ( f4 == null ) ? 0 : f4.hashCode() );
result = prime * result + ( ( f5 == null ) ? 0 : f5.hashCode() );
result = prime * result + ( ( f6 == null ) ? 0 : f6.hashCode() );
return result;
}

I've got a set of objects (all of the same type). I'm trying to think of a good way to divide it into equivalence classes, with equivalence of two objects defined as meaning a specified set of attributes are equal for both objects.

More concretely, I've got:
- a Java class with around 50 fields
- a bunch of instances of the class

I want:
- to divide the instances into a few sets
- in each set, each instance has field 1 - field 5 equal to fields 1-5 of the other instances in the set.

The method I've come up with is to generate a hashcode for each instance based on the hashcodes of fields 1-5*, and map the hashcode to one of my sets. Ignoring problems with potential hashcode collisions (which I'm expecting to be too rare to worry about for now), does that sound reasonable? It seems simple enough, but I'm wondering if there's a simpler method I haven't thought of.

If anyone can think of a better method (quicker to implement/quicker to run/easier to maintain and extend), I'd like to know! Let me know if anything's unclear.

cheers,

OS

* I'll generate the hashcode using a method based on Eclipse's generic hashcode method, which looks like this:

Code:

public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ( ( f1 == null ) ? 0 : f1.hashCode() );
result = prime * result + ( ( f2 == null ) ? 0 : f2.hashCode() );
result = prime * result + ( ( f3 == null ) ? 0 : f3.hashCode() );
result = prime * result + ( ( f4 == null ) ? 0 : f4.hashCode() );
result = prime * result + ( ( f5 == null ) ? 0 : f5.hashCode() );
result = prime * result + ( ( f6 == null ) ? 0 : f6.hashCode() );
return result;
}

Assuming the fields are strings why not simply concatenate them using a non-used in the character as the separator ? For example, an ASCII 1 (Control-A) character.

If the fields are not string and, say, they are numeric, they can first be trivially stringified.

I did wonder about using strings, and that's a good idea about using a control char as the separator, I'd then be able to guarantee there were no collisions, which I couldn't if I make a hashcode. I think it would be a bit slower, but with only 10,000 - 100,000 objects I might not notice the difference.

Are you performing clustering, or is the equivalence operator well-defined? In other words, do you know if two objects are equivalent without knowing about any of the other objects?
Kevin Barry

Hashing seems like a good idea, but you should consider an algorithm based on the limitations of the fields. For example, with normal text you're only using about 1/4 to 1/2 of the possible ASCII character values, so you might be able to compress the string before hashing (hashing is just an irreversible form of compression, anyway.) You also might consider making everything lower-case and removing punctuation before hashing. After all of this, you might actually be able to reduce your strings into something that can be directly compared without having to hash, thereby reducing the number of possible collisions.
Kevin Barry

LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.