LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-09-2012, 08:34 AM   #1
resetreset
Senior Member
 
Registered: Mar 2008
Location: Cyberspace
Distribution: Dynebolic, Ubuntu 10.10
Posts: 1,340

Rep: Reputation: 62
How to do fuzzy matching on a MySQL field


Hi,
For my website, I want to implement Search,ie. someone searching for "pr1nce" should get stuff like "prince".
My 1st question is: how is this done? (I want a DISCUSSION, not merely links to various libs that do it already).,
and 2) to make things complicated, "prince" will be inside a MySQL db.
How do I do the searching? The only thing that occurred to me is to do a %LIKE% , but this won't match "pr1nce" with "prince". Is there any solution, or should I keep stuff inside a text file (hellacious).

thanks for your help.
 
Old 01-10-2012, 12:59 AM   #2
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Compute a suitable regular expression from the search strings, then use REGEXP pattern in the SQL query.

How to compute a suitable regular expression, then? Well, it depends on how complex searches you want to support. For example, I hate that Google no longer supports +"multiple words" -"multiple words" -type searches. Your users might not need that complex of a search.

For a start, you could go through each search term character by character, and convert each one to a pattern. For example, characters i l 1 I L could all convert to pattern [il1IL] and so on. For security, ignore (or handle as whitespace) all characters you don't know how to convert.

I do not use SQL backends for my websites -- you just need too much hardware to survive a slashdotting using one for my taste -- so I have zero idea whether this is feasible in practice or not. At minimum, you might wish to somehow limit the number of searches done per second/minute.
 
Old 01-12-2012, 08:52 AM   #3
resetreset
Senior Member
 
Registered: Mar 2008
Location: Cyberspace
Distribution: Dynebolic, Ubuntu 10.10
Posts: 1,340

Original Poster
Rep: Reputation: 62
This is EXACTLY the sort of post I was hoping for, thanks so much Nominal.

But... I DO have some questions, but my brain is just not working now, I'll post again in this thread, OK?
 
Old 01-12-2012, 05:37 PM   #4
kalleanka
Member
 
Registered: Aug 2003
Location: Mallorca, Spain
Distribution: xubuntu
Posts: 551

Rep: Reputation: 38
I will need this as well.

Correct me if im wrong. The idee is to subtract "pr1nce" from "prince" and if its only one character or less in the result it should be listed. How would this look like in a quiery?
 
Old 01-13-2012, 01:03 AM   #5
resetreset
Senior Member
 
Registered: Mar 2008
Location: Cyberspace
Distribution: Dynebolic, Ubuntu 10.10
Posts: 1,340

Original Poster
Rep: Reputation: 62
What do you mean by "subtract"?
 
Old 01-13-2012, 02:59 AM   #6
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Quote:
Originally Posted by kalleanka View Post
Correct me if im wrong. The idee is to subtract "pr1nce" from "prince"
No, the idea is to search for "character 'p' or character 'P', followed by character 'r' or character 'R', followed by character 'i' or 'I' or 'l' or '1', followed by character 'n' or character 'N', followed by character 'c' or character 'C', followed by character 'e' or character 'E'.".

In MySQL, this is done using for example
Code:
WHERE field REGEXP '[pP][rR][iIl1][nN][cC][eE]'
and in PostgreSQL, using for example
Code:
WHERE field SIMILAR TO '%[pP][rR][iIl1][nN][cC][eE]%'
Note that the % in the latter one are not pattern delimiters, but equivalent to "any string", .* . This is very peculiar, and specific to PostgreSQL. Unlike normal regular expressions, PostgreSQL requires the pattern to match the entire content, therefore the pattern must be prepended and appended with the "any string". For some strange reason, the developers also decided to add % as shorthand for .* .

You do this by transforming the search terms to regular expression patterns using simple rules: each known character in the search term maps to a specific fixed string, for example 'i' to '[iI1l]' and '9' to '[9gq]'. Every unknown pattern and all whitespace you usually transform to .+ to indicate "anything but not empty". (It does make the matching less accurate, but it's usually best when the content strings may contain non-ASCII characters in other charsets.)

If you use UTF-8, you can do things like map 'é' to '(E|e|É|é|È|è|Ê|ê|€)' and 'ä' to '(a|A|ä|Ä|ae|AE)' to account for transliteration and accent differences. Note that the alternate pattern format will work better than alternate characters (e.g. '[EeÉéÈèÊê€]') since each UTF-8 character may be longer than one byte; using the alternate patterns the SQL server does not need to have explicit UTF-8 support.
 
Old 01-13-2012, 05:58 AM   #7
Guttorm
Senior Member
 
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 1,453

Rep: Reputation: 446Reputation: 446Reputation: 446Reputation: 446Reputation: 446
Hi

The function soundex can also be used for this. It works very well as long as the language is English, and will catch a lot of spelling errors. Some examples:

Code:
select soundex('prince') = soundex('pr1nce');
select soundex('loser') = soundex('looser');
select soundex('weird') = soundex('wierd');
 
Old 01-15-2012, 06:54 AM   #8
kalleanka
Member
 
Registered: Aug 2003
Location: Mallorca, Spain
Distribution: xubuntu
Posts: 551

Rep: Reputation: 38
with subtract i mean character by character so prince - pr1nce would be 008000. Since its only one none zero it would be listed. Maybe its not that good as since loser - looser would not be listed.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to do fuzzy matching on a MySQL field? resetreset Programming 2 02-18-2011 02:42 AM
MYSQL enumerated field eodchop Linux - Enterprise 0 06-08-2009 08:34 AM
Update several field in MySQL h725 Linux - Server 1 09-14-2008 05:08 AM
set mysql default value to value of other field kpachopoulos Programming 1 10-11-2007 09:04 PM
fuzzy mysql with PHP marghorp Linux - Software 9 04-23-2004 03:18 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 04:45 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration