How to do fuzzy matching on a MySQL field

resetreset · 01-09-2012, 08:34 AM

Hi,
For my website, I want to implement Search,ie. someone searching for "pr1nce" should get stuff like "prince".
My 1st question is: how is this done? (I want a DISCUSSION, not merely links to various libs that do it already).,
and 2) to make things complicated, "prince" will be inside a MySQL db.
How do I do the searching? The only thing that occurred to me is to do a %LIKE% , but this won't match "pr1nce" with "prince". Is there any solution, or should I keep stuff inside a text file (hellacious).

thanks for your help.

Nominal Animal · 01-10-2012, 12:59 AM

Compute a suitable regular expression from the search strings, then use REGEXP pattern in the SQL query.

How to compute a suitable regular expression, then? Well, it depends on how complex searches you want to support. For example, I hate that Google no longer supports +"multiple words" -"multiple words" -type searches. Your users might not need that complex of a search.

For a start, you could go through each search term character by character, and convert each one to a pattern. For example, characters i l 1 I L could all convert to pattern [il1IL] and so on. For security, ignore (or handle as whitespace) all characters you don't know how to convert.

I do not use SQL backends for my websites -- you just need too much hardware to survive a slashdotting using one for my taste -- so I have zero idea whether this is feasible in practice or not. At minimum, you might wish to somehow limit the number of searches done per second/minute.

resetreset · 01-12-2012, 08:52 AM

This is EXACTLY the sort of post I was hoping for, thanks so much Nominal.

But... I DO have some questions, but my brain is just not working now, I'll post again in this thread, OK?

kalleanka · 01-12-2012, 05:37 PM

I will need this as well.

Correct me if im wrong. The idee is to subtract "pr1nce" from "prince" and if its only one character or less in the result it should be listed. How would this look like in a quiery?

resetreset · 01-13-2012, 01:03 AM

What do you mean by "subtract"?

Nominal Animal · 01-13-2012, 02:59 AM

Quote:

Originally Posted by kalleanka

Correct me if im wrong. The idee is to subtract "pr1nce" from "prince"

No, the idea is to search for "character 'p' or character 'P', followed by character 'r' or character 'R', followed by character 'i' or 'I' or 'l' or '1', followed by character 'n' or character 'N', followed by character 'c' or character 'C', followed by character 'e' or character 'E'.".

In MySQL, this is done using for example

Code:

WHERE field REGEXP '[pP][rR][iIl1][nN][cC][eE]'

and in PostgreSQL, using for example

Code:

WHERE field SIMILAR TO '%[pP][rR][iIl1][nN][cC][eE]%'

Note that the % in the latter one are not pattern delimiters, but equivalent to "any string", .* . This is very peculiar, and specific to PostgreSQL. Unlike normal regular expressions, PostgreSQL requires the pattern to match the entire content, therefore the pattern must be prepended and appended with the "any string". For some strange reason, the developers also decided to add % as shorthand for .* .

You do this by transforming the search terms to regular expression patterns using simple rules: each known character in the search term maps to a specific fixed string, for example 'i' to '[iI1l]' and '9' to '[9gq]'. Every unknown pattern and all whitespace you usually transform to .+ to indicate "anything but not empty". (It does make the matching less accurate, but it's usually best when the content strings may contain non-ASCII characters in other charsets.)

If you use UTF-8, you can do things like map 'é' to '(E|e|É|é|È|è|Ê|ê|€)' and 'ä' to '(a|A|ä|Ä|ae|AE)' to account for transliteration and accent differences. Note that the alternate pattern format will work better than alternate characters (e.g. '[EeÉéÈèÊê€]') since each UTF-8 character may be longer than one byte; using the alternate patterns the SQL server does not need to have explicit UTF-8 support.

Guttorm · 01-13-2012, 05:58 AM

Hi

The function soundex can also be used for this. It works very well as long as the language is English, and will catch a lot of spelling errors. Some examples:

Code:

select soundex('prince') = soundex('pr1nce');
select soundex('loser') = soundex('looser');
select soundex('weird') = soundex('wierd');

kalleanka · 01-15-2012, 06:54 AM

with subtract i mean character by character so prince - pr1nce would be 008000. Since its only one none zero it would be listed. Maybe its not that good as since loser - looser would not be listed.