ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
This is my 3rd thread today. You gays are very helpful. Better than a purchased support. I am editing a huge website now. A civilian portal but it is gonna have over 5 million new entries in a year. And I have to create a search engine for that. I was going to create a full-indexing service (I mean the keywords are recorded to a table and the searching looks for data in this table). I tought it is better to ask you what technique shall I use to optimize the performace of the search engine, what techniques would you use. I think this sollution can work but it requires quite high system resources. (They gave me a server with two Opteron 244 processors and 8GB of RAM so I cannot use anything else but MySQL cause it is performance optimized. I hope it is going to be enough because my emplyers expect millions of requests a day).
And one more thing. I plan to use ADODB as it support hashing. Do you think it can improve the performance too?
Well my initial thought is that since ADODB is an abstraction layer it will have an overhead. If pure speed is important to you and you will not be changing your database then there is probably no need for it.
How are you doing your indexing? If you have a couple of tables one of keywords, a second acting as a resolver between the keywords and the documents then I'd say that is the right way to do it. If on the other hand you are indexing the documents and using the SQL %LIKE% feature then no that will very quickly cause the computer to grind to a halt.
Since you said it's a website, another option would be to index the website and not the MySQL data. I've tried using htdig, and found out its fast - even on quite a lot of data. It needs some configuration, but it's a lot easier than writing your own search engine. Indexing with htdig requires quite a lot of disk space for the indexes, but disks are cheap.
If that is not an option, the basic approach would be to make a table of all words - then a join table with ID's of the words and ID's of where they occure. Remember there aren't that many words in a vocabulary - and also you can "blacklist" a lot of words like "a", "it", "are" and so on.
There are some limitations to using the full-text search, but it's not a show stopper. You can search multiple full-text indexed fields at once, but they must all be full text. If they are, then it's still fast.
Wov... Ok... I am not as good at MySQL so I need some further help. What are the normalization stuff or what? I gues I need to design the DB that I know what are those.
I used full-text search on smaller DB-s but I do not think that was quick enough. This DB is going to have GB-s of data after a while. Do you still think it can be fast enough?
If you have a couple of tables one of keywords, a second acting as a resolver between the keywords and the documents then I'd say that is the right way to do it.
Graemef, can you give me some details or an example of your solution?
OK, think of it like this. What you want to do is create separate table(s) for indexing. MySQL does the same thing w/ its fulltext indexing, only it will do it a lot faster than you can with some complicated set of SELECT statements. If MySQL's full text search isn't fast enough w/ the amount of data you have, your solution won't be any better. Check out the user comments on the MySQL 5.0 Fulltext Documentation, http://dev.mysql.com/doc/refman/5.0/...xt-search.html . Some people are happy w/ it's performance at sizes your talking about, some are not, and many have tips on how to get the most out of it. I would consider doing 2 things. 1) Test it out with a LOT of test data (real-world if you have it), and see if it's performance meets your needs. 2) Have a back up plan handy should MySQL suddenly not suit your needs anymore. This could mean many things, like culling older rows, throwing more hardware at it, or switching to a new DB. From what I see on the net, 2x Opteron 244 proc. & 8GB of ram should handle your data, but there is no way I could know that!