ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
think about this:
i have an ip (example: "192.168.0.1") and a name of a real existing person (example: "John Doe")
in my database i have names and ip addresses
now i need a algorithm which is very fast, which can calculate the difference between what is in the database and what input data i have
what is this all about?
i need a algorithm which can evaluate the difference in strings depending on its contents including typos
so it shouldn't matter if i get a string:
"192.168.0.1|John Doe" or "John Doe|192.168.0.1"
or this is a better example:
it shouldn't matter if i get this:
"John Doe"
"Doe John"
for example:
if the user enters "Jon Doe" then my algorithm should find "John Doe" if there is no "Jon Doe" (seems it was a typo by the user (so the algorithm needs fuzzy logic))
why i need a hashing algorithm?
1. i need some scoring stuff which calculates a number which describes how much nonsense the user sent to me (not only yes = nonsense, no = ok (0,1) i also need: a bit nonsense, seems most right) and
2. i think the fastest method doing such stuff will be try to compare hashes
so my questions:
1. anybody has an idea of an hash algorithm which can replace the input done by the user?
(cause i thought, a string doesn't really differ to a hash, both are unique represenations, and hashes can/should be a number so it should be easier and faster to compare)
2. any idea on how i can do a fuzzy logic algorithm on hashes, which can manage the typo described above?
I am not at all persuaded that "hashing" would be of any use to you.
Algorithms like Soundex might be helpful in dealing with some of the ambiguity but you'll probably be using a fairly intensive dictionary-based algorithm.
but soundex only resolves the problem of misspelling
another problem is finding data which is'n in correct order for example ("John Doe" vs. "Doe John")
so i had the following idea
1. separate the string by ' ' character
2. generate a hash on every seperated string
3. now sum up every hash
this results in a unique hash for the whole string, which should speed up database searches for the person without the need of knowing if i have to search for "John Doe" or "Doe John"
and i would also know what will be the surname because of the format (forename,surname) safed in the DB
but this is a problem with the following scenario:
user sends his accommodation address to me
the first time he uses this format:
main street 12
another time the user uses this format (maybe he forgot his password and wants to register a second time ;-) )
mainstreet 12
or
mainstr 12
or
main str 12
soundex can only tell me some possibilities if the user misspelled the word
but how can i avoid abbreviations?
is the only possibility a DB of possible abbreviations?
thx@ll
[EDIT]
is it possible to know "how much the difference" is?
e.g. "John" and "Jon" result both in soundex "J500"
but "John" and "Peter" are very different ("P360")
would it be valid to generate a number like
int soundexhash_john='J'+500;
int soundexhash_peter='P'+360;
and just do "soundexhash_john-soundexhash_peter" to know how much difference is between the Strings?
Q1: would be soundex a good algorithm for validating how much the input differs from the database? (it could also be an IP or something different)
Q2: would be the subtraction of the two (very simple) hashes a good/valid method for this?
basically if you don't care if its john doe or doe john it means you want a rotation
invariant hashing to occur. basically you rotate the string you want until you discover
the lexically smallest (alphabetically smallest) version of said string (will be O(n))
and hash that then use the hash value for whatever it is you are doing.
In this case "john doe", "doe john" and "hn doejo" will all hash to the same value....
Arash Partow
________________________________________________________
Be one who knows what they don't know,
Instead of being one who knows not what they don't know,
Thinking they know everything about all things.
www_partow_net
stupid site wont let me post urls until I've posted at least 3 times, pffft!.......
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.