ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Hi,
For my website, I want to implement Search,ie. someone searching for "pr1nce" should get stuff like "prince".
My 1st question is: how is this done? (I want a DISCUSSION, not merely links to various libs that do it already).,
and 2) to make things complicated, "prince" will be inside a MySQL db.
How do I do the searching? The only thing that occurred to me is to do a %LIKE% , but this won't match "pr1nce" with "prince". Is there any solution, or should I keep stuff inside a text file (hellacious).
Compute a suitable regular expression from the search strings, then use REGEXP pattern in the SQL query.
How to compute a suitable regular expression, then? Well, it depends on how complex searches you want to support. For example, I hate that Google no longer supports +"multiple words" -"multiple words" -type searches. Your users might not need that complex of a search.
For a start, you could go through each search term character by character, and convert each one to a pattern. For example, characters i l 1 I L could all convert to pattern [il1IL] and so on. For security, ignore (or handle as whitespace) all characters you don't know how to convert.
I do not use SQL backends for my websites -- you just need too much hardware to survive a slashdotting using one for my taste -- so I have zero idea whether this is feasible in practice or not. At minimum, you might wish to somehow limit the number of searches done per second/minute.
Correct me if im wrong. The idee is to subtract "pr1nce" from "prince" and if its only one character or less in the result it should be listed. How would this look like in a quiery?
Correct me if im wrong. The idee is to subtract "pr1nce" from "prince"
No, the idea is to search for "character 'p' or character 'P', followed by character 'r' or character 'R', followed by character 'i' or 'I' or 'l' or '1', followed by character 'n' or character 'N', followed by character 'c' or character 'C', followed by character 'e' or character 'E'.".
In MySQL, this is done using for example
Code:
WHERE field REGEXP '[pP][rR][iIl1][nN][cC][eE]'
and in PostgreSQL, using for example
Code:
WHERE field SIMILAR TO '%[pP][rR][iIl1][nN][cC][eE]%'
Note that the % in the latter one are not pattern delimiters, but equivalent to "any string", .* . This is very peculiar, and specific to PostgreSQL. Unlike normal regular expressions, PostgreSQL requires the pattern to match the entire content, therefore the pattern must be prepended and appended with the "any string". For some strange reason, the developers also decided to add % as shorthand for .* .
You do this by transforming the search terms to regular expression patterns using simple rules: each known character in the search term maps to a specific fixed string, for example 'i' to '[iI1l]' and '9' to '[9gq]'. Every unknown pattern and all whitespace you usually transform to .+ to indicate "anything but not empty". (It does make the matching less accurate, but it's usually best when the content strings may contain non-ASCII characters in other charsets.)
If you use UTF-8, you can do things like map 'é' to '(E|e|É|é|È|è|Ê|ê|€)' and 'ä' to '(a|A|ä|Ä|ae|AE)' to account for transliteration and accent differences. Note that the alternate pattern format will work better than alternate characters (e.g. '[EeÉéÈèÊê€]') since each UTF-8 character may be longer than one byte; using the alternate patterns the SQL server does not need to have explicit UTF-8 support.
The function soundex can also be used for this. It works very well as long as the language is English, and will catch a lot of spelling errors. Some examples:
with subtract i mean character by character so prince - pr1nce would be 008000. Since its only one none zero it would be listed. Maybe its not that good as since loser - looser would not be listed.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.