LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 04-12-2016, 11:44 AM   #1
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,251

Rep: Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321
Modifying Antialiasing


Normally, antialiasing is a good thing for the human eye. Not as good for OCR. In the attached OCR.png, if you zoom it 400% or more the letters loses all form and not surprisingly, the results(Results.txt) are crap.

The text was extracted from a pdf at 400 dpi using pdftoppm and tiff output. I tried all the various combinations of antialiasing, aaVector, & freetype (on or off), but it made no difference. Pdfimages is worse. No 'Imagemagick convert' option permutation came good for me and gs is it's usual intransigent self.

Has anyone a good image processing treatment for getting better formed characters to give the OCR a fighting chance?
Attached Thumbnails
Click image for larger version

Name:	OCR.png
Views:	25
Size:	33.3 KB
ID:	21456  
Attached Files
File Type: txt Results.txt (157 Bytes, 11 views)
 
Old 04-13-2016, 05:38 AM   #2
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,789

Rep: Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304
I would use much higher resolution (instead of 400 dpi)
 
Old 04-13-2016, 10:19 AM   #3
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,251

Original Poster
Rep: Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321
Thanks. How high?

Disk space is an issue. That book is 601 pages. Imagemagick wants to process all 601 pages together, and the original is only 200-300 dpi at most.
 
Old 04-13-2016, 11:17 AM   #4
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,789

Rep: Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304
I have no idea. How much disk space do you have, how much was used with 400 dpi? You can also try black&white (or grayscale) conversion, which will lower the required disk space.
Try 600 dpi and if that was not good enough increase it further.

pdftoppm can handle pages, so you can try -f -l to check.
 
Old 04-14-2016, 05:47 AM   #5
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,251

Original Poster
Rep: Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321
I don't think that the answer, or that there is one. So I've marked this solved for the moment. If I get a good idea, I'll come up with a better question.

I did some of this as work in 2008/9 using Abbey Fine Reader, & Adobe Acrobat. You can see some of my work now deeply buried on www.eneclann.ie. 300 dpi was usually ok for the original scans, 400 dpi if the print was small. 200 dpi was no go. For ancient handwritten stuff, 400-600 dpi was normal. I never used 800 dpi.

Scans went to jpegs, which were OCR'ed, and then pdf'ed. The OCR text was put under the image, so that the pdf was searchable, but not visible. The OCR text was not edited to correct even obvious mistakes.Now here's the killer: Adobe anything is a windows memory hog, so the pdfs had to be kept small by reducing the image resolution so windows wouldn't crash. 200 dpi was usually the max you could put out on any large book.

In the attached images, I used 800dpi in pdftoppm, and thresholded at various levels in gimp. It looks like characters from a Sinclair ZX81. A pdftoppm at higher resolution will just produce bigger crap.

So the magic routine to add back resolution in a fashion that suits ocr has yet to be written. Anti Aliasing, which adds shades of grey in odd places for our eyes to merge doesn't help. When it's in the original, you can't rescue it without embedding the exact font in a rescue routine.

The immediate solution for me is to buy paper, and a scanner to OCR it if I can't read the paper. There are some reference books I will need to have on hand in coming years

EDIT: Forgot to mention All the 800 dpi output (Original & all thresholds) was pure crap.
Attached Thumbnails
Click image for larger version

Name:	Threshold_100.png
Views:	10
Size:	11.7 KB
ID:	21472   Click image for larger version

Name:	Thresholded_120.png
Views:	8
Size:	11.8 KB
ID:	21473   Click image for larger version

Name:	Thresholded_140.png
Views:	12
Size:	11.9 KB
ID:	21474  

Last edited by business_kid; 04-14-2016 at 05:53 AM.
 
Old 04-14-2016, 06:07 AM   #6
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,789

Rep: Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304
looks like the original letters are too small. You need to enlarge it (optically). Probably.
 
Old 04-15-2016, 02:28 AM   #7
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,251

Original Poster
Rep: Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321
As someone who has done a fair bit of OCR, that's fairly standard sized book print. You often get smaller. It's the low resolution of the original scan (150dpi) that's the killer.

I also know that to get readable books you need the best OCR followed by a massive spell correction with images on hand. If I buy another PC, I will equip it for OCR: Min of 4 cores, masses of ram & 2 large hard disks: and an edge scanner designed for scanning books.

EDIT: To make matters worse, that PDF is 150dpi in JPEG, which is a lossy compression format. That loses more info. Zoom any JPEG to 400% or 800% and you will see big squares.

Last edited by business_kid; 04-15-2016 at 11:29 AM.
 
Old 04-15-2016, 11:45 AM   #8
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,251

Original Poster
Rep: Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321
I came across one possible avenue if someone is desperate to do this. I'm not. I just had the time, because I'm currently recuperating.
  1. Install inkscape and it plethora of dependencies.
  2. Because of my original scans, I needed another stage before importing into inkscape. I had to tweak contrast +25% to get rid of antialiasing noise in gimp.
  3. Use pdftoppm on pdf without changing image size, and export as png or tiff.
  4. Import into Inkscape, & resize up. There's a youtube video
  5. Export as png and use tesseract (with Leptonica) on that.

Results were certainly the best I achieved, although not good enough for me on that pdf. I'm not sure if they're batchable either. Imagemagick seems to have some bug in it's svg implementation.
 
Old 04-15-2016, 11:52 AM   #9
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,789

Rep: Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304
Thanks. I remember I had similar difficulties too. I found only one ocr software, but unfortunately I have forgotten which one. I had got a lot of papers (printed documents) to scan/ocr. And I also remember I had to specify the language, charset to make it working.
 
Old 04-16-2016, 11:52 AM   #10
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,251

Original Poster
Rep: Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321
I'm continuing to fiddle occasionally.

  • Importing tiff is not great, because there are too many versions of the format, because everyone tweaked it. Use png.
  • Sizing up to 600 dpi in inkscape gave the best results, but I don't know if I can batch that in inkscape to work.
  • From tesseract, I got the aberration of '0xef, 0xac, 0x81' instead of 'fi' every time.

sed refuses to remove carriage returns and the occasional extra line feed, and dhex refuses to display them:-(.

Code:
sed -e 's/\xef\xac\x81/fi/g' infile >outfile
loses the aberration and replaces all of them with 'fi' but all of the following (suggested online) do nothing about the carriage return problem.

Code:
sed -e 's/\r//g'  infile >outfile 
sed -e 's/\x0a//g'  infile >outfile 
sed -e 's/\x0d//g'  infile >outfile 
sed -e 's/\^m//g'  infile >outfile 
sed -e 's/\^$//g'  infile >outfile
 
Old 04-17-2016, 02:10 AM   #11
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,789

Rep: Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304
use perl instead of sed, that is much more flexible. use od to check the content of the binary file, and also you may try vi to edit hex: http://www.kevssite.com/2009/04/21/u...-a-hex-editor/
but I do not really understand what is your problem
 
Old 04-17-2016, 10:36 AM   #12
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,251

Original Poster
Rep: Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321
I don't know what the problem is either.
I checked the file with xxd (found on my system from your link) and that shows the return as '0x0a' but sed doesn't shift it.In these
Code:
sed -e 's/\x0a/\x20/g' -e 's/\xef\xac\x81/fi/g' inkscape_600.txt > inkscape2.txt
sed -e 's/\0x0a/\0x20/g' -e 's/\xef\xac\x81/fi/g' inkscape_600.txt > inkscape2.txt
the second expression is edited but the first is ignored. I might use perl if I knew perl, but I don't. And I'm not going to start learning with my current low productivity.
 
Old 04-17-2016, 11:19 AM   #13
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,789

Rep: Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304Reputation: 7304
can you post an example file to work with?
 
Old 04-18-2016, 06:40 AM   #14
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,251

Original Poster
Rep: Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321
I had deleted a lot of temp files, but attached is a tesseract text output from a 300 dpi version of my test page with contrast enhanced by hand +25% in gimp. Going to 600 dpi in svg doesn't get any better - the information isn't there in my original.


This is interesting. It seems Abbyy - the leader in windows OCR software, does a linux cli version
http://www ocr4linux.com

There's a free trial of 100 pages, enough to test it. It's not free, or even cheap, but it may do the job. I have downloaded it, but won't report unless it is exceptionally good. My windows experience with Abbyy (version 5.?) rated it as better than open source, but nothing to write home about.
Attached Files
File Type: txt Inkscape-003+25.txt (2.4 KB, 5 views)
 
Old 04-20-2016, 12:15 PM   #15
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,251

Original Poster
Rep: Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321Reputation: 2321
I grabbed & tried the Abbyy ocr cli version for linux. They give a 100 page license. Great variety of options & output formats but it means a long learning curve.

It's a bitch to install on Slackware.

I made up a test page of various fonts from 14pt down to 7 pt, and made an image of it at 300 dpi. It's good. Anything normal reads fine - down even to 7 pt. Script style fonts do not come through at all.

My rather atrocious pdf came through fairly well with _no_ image processing. It also has an option which strips carriage returns, another for page breaks. It does not seem to handle wildcards which is a pain. I have a query on their google group. If I have a justification for it, I may buy, but @$150 for 10k pages, it's steep.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
SuSE9.0 - Antialiasing napoleon Linux - Newbie 2 02-17-2004 02:09 AM
How to turn antialiasing off? the_styler Red Hat 2 01-06-2004 01:23 PM
how do i turn off antialiasing? asc3ndant Linux - General 2 07-11-2003 10:42 PM
Slackware 9, No antialiasing. ingy866 Slackware 2 07-01-2003 04:50 PM
Help with antialiasing!!! el_felipe Linux - Newbie 0 12-29-2001 06:02 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 10:19 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration