LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   Modifying Antialiasing (https://www.linuxquestions.org/questions/linux-software-2/modifying-antialiasing-4175577284/)

business_kid 04-12-2016 11:44 AM

Modifying Antialiasing
 
2 Attachment(s)
Normally, antialiasing is a good thing for the human eye. Not as good for OCR. In the attached OCR.png, if you zoom it 400% or more the letters loses all form and not surprisingly, the results(Results.txt) are crap.

The text was extracted from a pdf at 400 dpi using pdftoppm and tiff output. I tried all the various combinations of antialiasing, aaVector, & freetype (on or off), but it made no difference. Pdfimages is worse. No 'Imagemagick convert' option permutation came good for me and gs is it's usual intransigent self.

Has anyone a good image processing treatment for getting better formed characters to give the OCR a fighting chance?

pan64 04-13-2016 05:38 AM

I would use much higher resolution (instead of 400 dpi)

business_kid 04-13-2016 10:19 AM

Thanks. How high?

Disk space is an issue. That book is 601 pages. Imagemagick wants to process all 601 pages together, and the original is only 200-300 dpi at most.

pan64 04-13-2016 11:17 AM

I have no idea. How much disk space do you have, how much was used with 400 dpi? You can also try black&white (or grayscale) conversion, which will lower the required disk space.
Try 600 dpi and if that was not good enough increase it further.

pdftoppm can handle pages, so you can try -f -l to check.

business_kid 04-14-2016 05:47 AM

3 Attachment(s)
I don't think that the answer, or that there is one. So I've marked this solved for the moment. If I get a good idea, I'll come up with a better question.

I did some of this as work in 2008/9 using Abbey Fine Reader, & Adobe Acrobat. You can see some of my work now deeply buried on www.eneclann.ie. 300 dpi was usually ok for the original scans, 400 dpi if the print was small. 200 dpi was no go. For ancient handwritten stuff, 400-600 dpi was normal. I never used 800 dpi.

Scans went to jpegs, which were OCR'ed, and then pdf'ed. The OCR text was put under the image, so that the pdf was searchable, but not visible. The OCR text was not edited to correct even obvious mistakes.Now here's the killer: Adobe anything is a windows memory hog, so the pdfs had to be kept small by reducing the image resolution so windows wouldn't crash. 200 dpi was usually the max you could put out on any large book.

In the attached images, I used 800dpi in pdftoppm, and thresholded at various levels in gimp. It looks like characters from a Sinclair ZX81. A pdftoppm at higher resolution will just produce bigger crap.

So the magic routine to add back resolution in a fashion that suits ocr has yet to be written. Anti Aliasing, which adds shades of grey in odd places for our eyes to merge doesn't help. When it's in the original, you can't rescue it without embedding the exact font in a rescue routine.

The immediate solution for me is to buy paper, and a scanner to OCR it if I can't read the paper. There are some reference books I will need to have on hand in coming years

EDIT: Forgot to mention All the 800 dpi output (Original & all thresholds) was pure crap.

pan64 04-14-2016 06:07 AM

looks like the original letters are too small. You need to enlarge it (optically). Probably.

business_kid 04-15-2016 02:28 AM

As someone who has done a fair bit of OCR, that's fairly standard sized book print. You often get smaller. It's the low resolution of the original scan (150dpi) that's the killer.

I also know that to get readable books you need the best OCR followed by a massive spell correction with images on hand. If I buy another PC, I will equip it for OCR: Min of 4 cores, masses of ram & 2 large hard disks: and an edge scanner designed for scanning books.

EDIT: To make matters worse, that PDF is 150dpi in JPEG, which is a lossy compression format. That loses more info. Zoom any JPEG to 400% or 800% and you will see big squares.

business_kid 04-15-2016 11:45 AM

I came across one possible avenue if someone is desperate to do this. I'm not. I just had the time, because I'm currently recuperating.
  1. Install inkscape and it plethora of dependencies.
  2. Because of my original scans, I needed another stage before importing into inkscape. I had to tweak contrast +25% to get rid of antialiasing noise in gimp.
  3. Use pdftoppm on pdf without changing image size, and export as png or tiff.
  4. Import into Inkscape, & resize up. There's a youtube video
  5. Export as png and use tesseract (with Leptonica) on that.

Results were certainly the best I achieved, although not good enough for me on that pdf. I'm not sure if they're batchable either. Imagemagick seems to have some bug in it's svg implementation.

pan64 04-15-2016 11:52 AM

Thanks. I remember I had similar difficulties too. I found only one ocr software, but unfortunately I have forgotten which one. I had got a lot of papers (printed documents) to scan/ocr. And I also remember I had to specify the language, charset to make it working.

business_kid 04-16-2016 11:52 AM

I'm continuing to fiddle occasionally.

  • Importing tiff is not great, because there are too many versions of the format, because everyone tweaked it. Use png.
  • Sizing up to 600 dpi in inkscape gave the best results, but I don't know if I can batch that in inkscape to work.
  • From tesseract, I got the aberration of '0xef, 0xac, 0x81' instead of 'fi' every time.

sed refuses to remove carriage returns and the occasional extra line feed, and dhex refuses to display them:-(.

Code:

sed -e 's/\xef\xac\x81/fi/g' infile >outfile
loses the aberration and replaces all of them with 'fi' but all of the following (suggested online) do nothing about the carriage return problem.

Code:

sed -e 's/\r//g'  infile >outfile
sed -e 's/\x0a//g'  infile >outfile
sed -e 's/\x0d//g'  infile >outfile
sed -e 's/\^m//g'  infile >outfile
sed -e 's/\^$//g'  infile >outfile


pan64 04-17-2016 02:10 AM

use perl instead of sed, that is much more flexible. use od to check the content of the binary file, and also you may try vi to edit hex: http://www.kevssite.com/2009/04/21/u...-a-hex-editor/
but I do not really understand what is your problem

business_kid 04-17-2016 10:36 AM

I don't know what the problem is either.
I checked the file with xxd (found on my system from your link) and that shows the return as '0x0a' but sed doesn't shift it.In these
Code:

sed -e 's/\x0a/\x20/g' -e 's/\xef\xac\x81/fi/g' inkscape_600.txt > inkscape2.txt
sed -e 's/\0x0a/\0x20/g' -e 's/\xef\xac\x81/fi/g' inkscape_600.txt > inkscape2.txt

the second expression is edited but the first is ignored. I might use perl if I knew perl, but I don't. And I'm not going to start learning with my current low productivity.

pan64 04-17-2016 11:19 AM

can you post an example file to work with?

business_kid 04-18-2016 06:40 AM

1 Attachment(s)
I had deleted a lot of temp files, but attached is a tesseract text output from a 300 dpi version of my test page with contrast enhanced by hand +25% in gimp. Going to 600 dpi in svg doesn't get any better - the information isn't there in my original.


This is interesting. It seems Abbyy - the leader in windows OCR software, does a linux cli version
http://www ocr4linux.com

There's a free trial of 100 pages, enough to test it. It's not free, or even cheap, but it may do the job. I have downloaded it, but won't report unless it is exceptionally good. My windows experience with Abbyy (version 5.?) rated it as better than open source, but nothing to write home about.

business_kid 04-20-2016 12:15 PM

I grabbed & tried the Abbyy ocr cli version for linux. They give a 100 page license. Great variety of options & output formats but it means a long learning curve.

It's a bitch to install on Slackware.

I made up a test page of various fonts from 14pt down to 7 pt, and made an image of it at 300 dpi. It's good. Anything normal reads fine - down even to 7 pt. Script style fonts do not come through at all.

My rather atrocious pdf came through fairly well with _no_ image processing. It also has an option which strips carriage returns, another for page breaks. It does not seem to handle wildcards which is a pain. I have a query on their google group. If I have a justification for it, I may buy, but @$150 for 10k pages, it's steep.


All times are GMT -5. The time now is 02:13 AM.