PHP: text search in StarOffice and OpenOffice documents, how to do it fast?
I added a text search capability via apache+PHP to one of our unix servers, to search text in StarOffice (sdw) and OpenOffice (sxw) documents.
It is a must, since Windows XP does not seem to search text in those files (Win98 did it in sdw!).
Searching text in sdw under unix turned out to be as easy as calling grep and building the result page by PHP. It rocks: ways faster then WinXP's own search util with .doc files.
I have problem with OpenOffice documents, which are zipped xml files. The present search method seems to be very-very slow.
Here is how I do it now:
1. Run find from PHP to apply some search conditions on filename (this step is fast, no need for tweaking)
2. In the case of each found file, PHP passes the filename and the search pattern ('words') to a shell script (shell_exec), that
- calls unzip to extract 'content.xml' from the sxw file and
- pipes it to sed to remove the xml tags from the text, then
- calls grep as many times as many search words there are (this is fast)
- returns the filename if all search words were found, and does not return anything if no, or not all of the search words were found.
Could you give me an idea, how to tweak the above method to be fast, or if there is a command line tool that can 'grep' OpenOffice documents in one step? (zgrep does not seem to do it).
Last edited by J_Szucs; 11-22-2003 at 06:35 AM.