LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-02-2006, 02:52 AM   #1
Thinking
Member
 
Registered: Oct 2003
Posts: 249

Rep: Reputation: 30
need a hash algorithm ignoring input order


hiho@ll

think about this:
i have an ip (example: "192.168.0.1") and a name of a real existing person (example: "John Doe")

in my database i have names and ip addresses

now i need a algorithm which is very fast, which can calculate the difference between what is in the database and what input data i have

what is this all about?
i need a algorithm which can evaluate the difference in strings depending on its contents including typos
so it shouldn't matter if i get a string:
"192.168.0.1|John Doe" or "John Doe|192.168.0.1"
or this is a better example:
it shouldn't matter if i get this:
"John Doe"
"Doe John"

for example:
if the user enters "Jon Doe" then my algorithm should find "John Doe" if there is no "Jon Doe" (seems it was a typo by the user (so the algorithm needs fuzzy logic))

why i need a hashing algorithm?
1. i need some scoring stuff which calculates a number which describes how much nonsense the user sent to me (not only yes = nonsense, no = ok (0,1) i also need: a bit nonsense, seems most right) and
2. i think the fastest method doing such stuff will be try to compare hashes

so my questions:
1. anybody has an idea of an hash algorithm which can replace the input done by the user?
(cause i thought, a string doesn't really differ to a hash, both are unique represenations, and hashes can/should be a number so it should be easier and faster to compare)
2. any idea on how i can do a fuzzy logic algorithm on hashes, which can manage the typo described above?

thx@ll
 
Old 01-02-2006, 09:53 PM   #2
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,659
Blog Entries: 4

Rep: Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941
I am not at all persuaded that "hashing" would be of any use to you.

Algorithms like Soundex might be helpful in dealing with some of the ambiguity but you'll probably be using a fairly intensive dictionary-based algorithm.
 
Old 01-03-2006, 05:21 AM   #3
Thinking
Member
 
Registered: Oct 2003
Posts: 249

Original Poster
Rep: Reputation: 30
soundex looks good
i'm reading about it

thx

but soundex only resolves the problem of misspelling
another problem is finding data which is'n in correct order for example ("John Doe" vs. "Doe John")
so i had the following idea

1. separate the string by ' ' character
2. generate a hash on every seperated string
3. now sum up every hash
this results in a unique hash for the whole string, which should speed up database searches for the person without the need of knowing if i have to search for "John Doe" or "Doe John"
and i would also know what will be the surname because of the format (forename,surname) safed in the DB

but this is a problem with the following scenario:

user sends his accommodation address to me
the first time he uses this format:
main street 12
another time the user uses this format (maybe he forgot his password and wants to register a second time ;-) )
mainstreet 12
or
mainstr 12
or
main str 12

soundex can only tell me some possibilities if the user misspelled the word
but how can i avoid abbreviations?
is the only possibility a DB of possible abbreviations?

thx@ll

[EDIT]
is it possible to know "how much the difference" is?
e.g. "John" and "Jon" result both in soundex "J500"
but "John" and "Peter" are very different ("P360")
would it be valid to generate a number like
int soundexhash_john='J'+500;
int soundexhash_peter='P'+360;
and just do "soundexhash_john-soundexhash_peter" to know how much difference is between the Strings?
Q1: would be soundex a good algorithm for validating how much the input differs from the database? (it could also be an IP or something different)
Q2: would be the subtraction of the two (very simple) hashes a good/valid method for this?

thx
[/EDIT]

Last edited by Thinking; 01-03-2006 at 09:19 AM.
 
Old 07-12-2006, 06:09 AM   #4
arashpartow
LQ Newbie
 
Registered: Feb 2006
Posts: 5

Rep: Reputation: 0
lexical ordering is the key.

basically if you don't care if its john doe or doe john it means you want a rotation
invariant hashing to occur. basically you rotate the string you want until you discover
the lexically smallest (alphabetically smallest) version of said string (will be O(n))
and hash that then use the hash value for whatever it is you are doing.

In this case "john doe", "doe john" and "hn doejo" will all hash to the same value....

Also have a look at:


www_partow_net_programming/hashfunctions/index_html



Arash Partow
________________________________________________________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
www_partow_net


stupid site wont let me post urls until I've posted at least 3 times, pffft!.......
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
need a hash algorithm ignoring input order Thinking Programming 1 01-02-2006 05:15 PM
Ctrl+Shift Unicode input gone, after installing Japanese Input Methodes polemon Linux - Newbie 1 09-20-2005 05:17 PM
Change Password Hash Algorithm Trano Linux - Security 1 08-23-2005 07:48 AM
Sendmail: timeout waiting for input from local during Draining Input andrewstr Linux - Software 0 07-14-2004 01:43 PM
my mouse input is takes as keyboard input in BASH e1000 Slackware 5 12-08-2003 03:00 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:15 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration