LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Slackware
User Name
Password
Slackware This Forum is for the discussion of Slackware Linux.

Notices


Reply
  Search this Thread
Old 06-11-2014, 11:37 AM   #1
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,835

Rep: Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424
Neat image handling wheeze required.


I have a fairly massive pdf (24M) of images which is poorly laid out and clunky to view on an intel GPU. I want to

1. Split the pdf into images
2. Batch process them through tesseract OCR which I have working here. The image quality is good enough that I can expect good OCR

How do I split that pdf into images? anybody done this with gs, or something else. Gimp is not an option speedwise.
 
Old 06-11-2014, 11:46 AM   #2
Didier Spaier
LQ Addict
 
Registered: Nov 2008
Location: Paris, France
Distribution: Slint64-15.0
Posts: 11,146

Rep: Reputation: Disabled
I'd use pdfimages, shipped in the poppler package.

Last edited by Didier Spaier; 06-11-2014 at 11:51 AM. Reason: Sentence was missing a verb.
 
Old 06-11-2014, 01:35 PM   #3
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,835

Original Poster
Rep: Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424
That seemed to do it all right. They came out in some funny format. The book has 362 pages (=24 Megs) , but I'm up on image number 1050 & counting when I called halt with Ctrl_C >:-/. I have over 6 gigs there!

Thanks Didier - never knew it was there. It will be funny to see what it did.
 
Old 06-11-2014, 01:43 PM   #4
Didier Spaier
LQ Addict
 
Registered: Nov 2008
Location: Paris, France
Distribution: Slint64-15.0
Posts: 11,146

Rep: Reputation: Disabled
Quote:
Originally Posted by business_kid View Post
I have over 6 gigs there!
use the -j option: JPEG files are *much* lighter then PPM or PBM ones.
 
Old 06-12-2014, 03:49 AM   #5
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,835

Original Poster
Rep: Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424
I did! It requires the dct file base to be in the pdf, which it wasn't.

For each page I ended up with inverse text, .pbm white text on black, @~1Meg each, some background colour for each page, .ppm, @~2.7 Megs each and some rendered looking apparition based loosely on the text (Like the text melting and falling down), .ppm, @24 Megs each. I could only keep sanity in the home dir by running

rm *.ppm which left the pbms. Now to find some way of mass processing .pbm images (I feel a script using imagemagick is probably my only installed shot at that) but I'm not in scripting humour quite yet. Coffee is acting slowly this morning.
 
Old 06-12-2014, 03:58 AM   #6
Didier Spaier
LQ Addict
 
Registered: Nov 2008
Location: Paris, France
Distribution: Slint64-15.0
Posts: 11,146

Rep: Reputation: Disabled
Quote:
Originally Posted by business_kid View Post
Coffee is acting slowly this morning.
I will have coffee in a few minutes, so I can understand your feeling

Good luck.
 
Old 06-13-2014, 03:14 AM   #7
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,835

Original Poster
Rep: Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424
Well, I let tesseract loose in a mini script to do a batch job, and it converted pretty faultlessly, except that it recognizes 'fi' wrongly and I get a white square wherever in a word that letter combination occurs.

I currently have 350+ txt files, which I intend to remove, find a way of fixing the 'fi' thing in tesseract, and alter my small script slightly to send them all to the end of one big file.
 
Old 06-13-2014, 07:43 AM   #8
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,835

Original Poster
Rep: Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424Reputation: 2424
This nearly does it

Code:
#!/bin/bash
FILES=/home/dec/historical_books/temp/*.pbm
for f in $FILES
do
  echo "Processing $f file..."
  # take action on each file. $f store current file name
  tesseract $f  book
  # This second argument of that writes a text file book.txt. It refuses to
  # pipe even to stdout or write to a fifo
  cat book.txt >> Book.txt
  rm book.txt # not strictly necessary
done
I get the book. The one refinement I need now is a new page marker between each page. Any ideas?
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Squeeze to wheeze update orcaja Linux - Newbie 8 08-13-2013 09:21 PM
[SOLVED] Cannot login in Debian after switching to wheeze odin_ago Linux - Software 3 11-06-2011 03:38 PM
help required for kickstart image salilgk Linux - Enterprise 2 10-22-2009 06:07 PM
is really initrd image required. shellarchive Red Hat 2 01-02-2008 10:33 PM
PHP image handling/resizing benrose111488 Programming 1 07-07-2005 02:00 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Distributions > Slackware

All times are GMT -5. The time now is 09:58 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration