SlackwareThis Forum is for the discussion of Slackware Linux.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have a fairly massive pdf (24M) of images which is poorly laid out and clunky to view on an intel GPU. I want to
1. Split the pdf into images
2. Batch process them through tesseract OCR which I have working here. The image quality is good enough that I can expect good OCR
How do I split that pdf into images? anybody done this with gs, or something else. Gimp is not an option speedwise.
That seemed to do it all right. They came out in some funny format. The book has 362 pages (=24 Megs) , but I'm up on image number 1050 & counting when I called halt with Ctrl_C >:-/. I have over 6 gigs there!
Thanks Didier - never knew it was there. It will be funny to see what it did.
I did! It requires the dct file base to be in the pdf, which it wasn't.
For each page I ended up with inverse text, .pbm white text on black, @~1Meg each, some background colour for each page, .ppm, @~2.7 Megs each and some rendered looking apparition based loosely on the text (Like the text melting and falling down), .ppm, @24 Megs each. I could only keep sanity in the home dir by running
rm *.ppm which left the pbms. Now to find some way of mass processing .pbm images (I feel a script using imagemagick is probably my only installed shot at that) but I'm not in scripting humour quite yet. Coffee is acting slowly this morning.
Well, I let tesseract loose in a mini script to do a batch job, and it converted pretty faultlessly, except that it recognizes 'fi' wrongly and I get a white square wherever in a word that letter combination occurs.
I currently have 350+ txt files, which I intend to remove, find a way of fixing the 'fi' thing in tesseract, and alter my small script slightly to send them all to the end of one big file.
#!/bin/bash
FILES=/home/dec/historical_books/temp/*.pbm
for f in $FILES
do
echo "Processing $f file..."
# take action on each file. $f store current file name
tesseract $f book
# This second argument of that writes a text file book.txt. It refuses to
# pipe even to stdout or write to a fifo
cat book.txt >> Book.txt
rm book.txt # not strictly necessary
done
I get the book. The one refinement I need now is a new page marker between each page. Any ideas?
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.