LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 12-02-2004, 05:09 AM   #1
merlin23
Member
 
Registered: Dec 2004
Location: Vienna
Posts: 46

Rep: Reputation: 15
htdig problem


Hi anyone, I don't know if this thread would be better put into the networking section, but nevermind, here's my problem:

I have an apache installed with htdig to parse a big amount of html,doc,pdf files and so on...
(about a few ten-thousands of files)...
Every night the whole things get updated with crontab, but every morning I find hundreds of errors in my log-Files ...
Mostly these are pdf-files, that get scanned with an external parser (pdftotext) or doc and ppt files...

What makes me wonder is, that if I parse the pdfs which were not merged by hand with pdftotext, there sometimes does not seem to be any problem.
Why did htdig had problems? Is this maybe due to a lack of memory? Is htdig overloaded with so much data?

Also, what makes me wonder, is that very often files that did not get parsed at the last merge,
get now merged, and some that were parsed the last time, now don't.

Maybe there would be solutions with making 2 databases, and every merge copy the files with problems somewhere else and rescan them, afterwards merge the 2 databases together?

Has somebody some experience or could give me a hint how to solve the problem in another way maybe?

Last edited by merlin23; 12-02-2004 at 05:13 AM.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
htdig install MichaelHall Slackware - Installation 2 03-20-2004 08:57 PM
perl problem? apache problem? cgi problem? WorldBuilder Linux - Software 1 09-17-2003 07:45 PM
why my "htdig" only search serail words? beelzebub888 Linux - Software 0 12-28-2002 05:20 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 01:20 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration