need a hash algorithm ignoring input order

Thinking · 01-02-2006, 02:52 AM

hiho@ll

think about this:
i have an ip (example: "192.168.0.1") and a name of a real existing person (example: "John Doe")

in my database i have names and ip addresses

now i need a algorithm which is very fast, which can calculate the difference between what is in the database and what input data i have

what is this all about?
i need a algorithm which can evaluate the difference in strings depending on its contents including typos
so it shouldn't matter if i get a string:
"192.168.0.1|John Doe" or "John Doe|192.168.0.1"
or this is a better example:
it shouldn't matter if i get this:
"John Doe"
"Doe John"

for example:
if the user enters "Jon Doe" then my algorithm should find "John Doe" if there is no "Jon Doe" (seems it was a typo by the user (so the algorithm needs fuzzy logic))

why i need a hashing algorithm?
1. i need some scoring stuff which calculates a number which describes how much nonsense the user sent to me (not only yes = nonsense, no = ok (0,1) i also need: a bit nonsense, seems most right) and
2. i think the fastest method doing such stuff will be try to compare hashes

so my questions:
1. anybody has an idea of an hash algorithm which can replace the input done by the user?
(cause i thought, a string doesn't really differ to a hash, both are unique represenations, and hashes can/should be a number so it should be easier and faster to compare)
2. any idea on how i can do a fuzzy logic algorithm on hashes, which can manage the typo described above?

thx@ll

sundialsvcs · 01-02-2006, 09:53 PM

I am not at all persuaded that "hashing" would be of any use to you.

Algorithms like Soundex might be helpful in dealing with some of the ambiguity but you'll probably be using a fairly intensive dictionary-based algorithm.

Thinking · 01-03-2006, 05:21 AM

soundex looks good
i'm reading about it

thx

but soundex only resolves the problem of misspelling
another problem is finding data which is'n in correct order for example ("John Doe" vs. "Doe John")
so i had the following idea

1. separate the string by ' ' character
2. generate a hash on every seperated string
3. now sum up every hash
this results in a unique hash for the whole string, which should speed up database searches for the person without the need of knowing if i have to search for "John Doe" or "Doe John"
and i would also know what will be the surname because of the format (forename,surname) safed in the DB

but this is a problem with the following scenario:

user sends his accommodation address to me
the first time he uses this format:
main street 12
another time the user uses this format (maybe he forgot his password and wants to register a second time ;-) )
mainstreet 12
or
mainstr 12
or
main str 12

soundex can only tell me some possibilities if the user misspelled the word
but how can i avoid abbreviations?
is the only possibility a DB of possible abbreviations?

thx@ll

[EDIT]
is it possible to know "how much the difference" is?
e.g. "John" and "Jon" result both in soundex "J500"
but "John" and "Peter" are very different ("P360")
would it be valid to generate a number like
int soundexhash_john='J'+500;
int soundexhash_peter='P'+360;
and just do "soundexhash_john-soundexhash_peter" to know how much difference is between the Strings?
Q1: would be soundex a good algorithm for validating how much the input differs from the database? (it could also be an IP or something different)
Q2: would be the subtraction of the two (very simple) hashes a good/valid method for this?

thx
[/EDIT]

arashpartow · 07-12-2006, 06:09 AM

lexical ordering is the key.

basically if you don't care if its john doe or doe john it means you want a rotation
invariant hashing to occur. basically you rotate the string you want until you discover
the lexically smallest (alphabetically smallest) version of said string (will be O(n))
and hash that then use the hash value for whatever it is you are doing.

In this case "john doe", "doe john" and "hn doejo" will all hash to the same value....

Also have a look at:

www_partow_net_programming/hashfunctions/index_html

Arash Partow
________________________________________________________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
www_partow_net

stupid site wont let me post urls until I've posted at least 3 times, pffft!.......