PHP: text search in StarOffice and OpenOffice documents, how to do it fast?
Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
PHP: text search in StarOffice and OpenOffice documents, how to do it fast?
I added a text search capability via apache+PHP to one of our unix servers, to search text in StarOffice (sdw) and OpenOffice (sxw) documents.
It is a must, since Windows XP does not seem to search text in those files (Win98 did it in sdw!).
Searching text in sdw under unix turned out to be as easy as calling grep and building the result page by PHP. It rocks: ways faster then WinXP's own search util with .doc files.
I have problem with OpenOffice documents, which are zipped xml files. The present search method seems to be very-very slow.
Here is how I do it now:
1. Run find from PHP to apply some search conditions on filename (this step is fast, no need for tweaking)
2. In the case of each found file, PHP passes the filename and the search pattern ('words') to a shell script (shell_exec), that
- calls unzip to extract 'content.xml' from the sxw file and
- pipes it to sed to remove the xml tags from the text, then
- calls grep as many times as many search words there are (this is fast)
- returns the filename if all search words were found, and does not return anything if no, or not all of the search words were found.
Could you give me an idea, how to tweak the above method to be fast, or if there is a command line tool that can 'grep' OpenOffice documents in one step? (zgrep does not seem to do it).
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.