htdig problem
Hi anyone, I don't know if this thread would be better put into the networking section, but nevermind, here's my problem:
I have an apache installed with htdig to parse a big amount of html,doc,pdf files and so on...
(about a few ten-thousands of files)...
Every night the whole things get updated with crontab, but every morning I find hundreds of errors in my log-Files ...
Mostly these are pdf-files, that get scanned with an external parser (pdftotext) or doc and ppt files...
What makes me wonder is, that if I parse the pdfs which were not merged by hand with pdftotext, there sometimes does not seem to be any problem.
Why did htdig had problems? Is this maybe due to a lack of memory? Is htdig overloaded with so much data?
Also, what makes me wonder, is that very often files that did not get parsed at the last merge,
get now merged, and some that were parsed the last time, now don't.
Maybe there would be solutions with making 2 databases, and every merge copy the files with problems somewhere else and rescan them, afterwards merge the 2 databases together?
Has somebody some experience or could give me a hint how to solve the problem in another way maybe?
Last edited by merlin23; 12-02-2004 at 05:13 AM.
|