LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Desktop
User Name
Password
Linux - Desktop This forum is for the discussion of all Linux Software used in a desktop context.

Notices


Reply
  Search this Thread
Old 11-21-2018, 07:08 AM   #1
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,144

Rep: Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308
PDF Re-mangling Query


I have been digging some pdfs from the Smithsonian Archives and the pdfs are sad. They have a plain background, Kind of a pink/beige hybrid which was evidently inserted at 5000 dpi Lets take an example
Code:
-rw-r--r-- 1 dec users 5.6M Nov 20 20:22 SMC_131_Wetmore_1956_5_1-105.pdf
-rw-r--r-- 1 dec users 197K Nov 21 12:41 SMC_131_Wetmore_1956_5_1-105.txt
When the pdf is viewed, xpdf (a ghostscript thing) takes 3-10 seconds per page on my sad cpu, & sadder graphics. depending on zoom. I ran it through pdftotext, but

the

resulting

(71k)

file comes

out

like this.

And some chunks are clearly missing. Is there a way of stripping the background? Pictures don't appear to be an issue.
 
Old 11-21-2018, 08:52 AM   #2
hydrurga
LQ Guru
 
Registered: Nov 2008
Location: Pictland
Distribution: Linux Mint 21 MATE
Posts: 8,048
Blog Entries: 5

Rep: Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925
Can you link to a particular example pdf from the Smithsonian so that I can test it?
 
Old 11-21-2018, 12:12 PM   #3
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,144

Original Poster
Rep: Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308
https://library.si.edu/digital-libra...bo18551856smit
 
Old 11-21-2018, 12:44 PM   #4
hydrurga
LQ Guru
 
Registered: Nov 2008
Location: Pictland
Distribution: Linux Mint 21 MATE
Posts: 8,048
Blog Entries: 5

Rep: Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925
That background is the actual paper of the document that was scanned in order to create the PDF, at the given scan resolution. It wasn't inserted, it's necessarily part of the same layer as the visual text on the scanned document so will be impossible to remove without specialised processing. I assume that a separate OCR-generated hidden text layer has been added, which is what you see when you use pdftotext.

You would better finding a more effective PDF->Text translator (with good OCR) or download it in one of the other formats offered by the Smithsonian, e.g. Plain Text (although this just provides an OCR-generated document which evidently hasn't been checked for accuracy).
 
Old 11-21-2018, 01:27 PM   #5
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
i told you before, those downloads are available in various formats.
have you tried the epub format instead?
or even plain text?
 
Old 11-22-2018, 09:15 AM   #6
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,144

Original Poster
Rep: Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308
I avoid epub when I can, because calibre really sucks, imho. Text, as someone pointed out, is undeited OCR.

I did OCR work for a time, and to get it distortion free like that is and vertical/horizontal is basically impossible with any equipment I had access to. I was involved in the '1641 Depositions' project http://www.1641.tcd.ie/
You can't split OCR layers without very fancy work (As in the Nasa/Planck CMB). I had hoped otherwise, because on my box, I get the right half of screen as background, then the left half as background, then the type. So I thought it might be separate.

I'll just minimize my use of that stuff, and suffer when I have to.

EDIT: I took a look at the Plain Text. It's just like the pdftotext output I was bellyaching about in post #1

Last edited by business_kid; 11-22-2018 at 09:22 AM.
 
Old 11-22-2018, 09:51 AM   #7
hydrurga
LQ Guru
 
Registered: Nov 2008
Location: Pictland
Distribution: Linux Mint 21 MATE
Posts: 8,048
Blog Entries: 5

Rep: Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925
Quote:
Originally Posted by business_kid View Post
I avoid epub when I can, because calibre really sucks, imho. Text, as someone pointed out, is undeited OCR.

I did OCR work for a time, and to get it distortion free like that is and vertical/horizontal is basically impossible with any equipment I had access to. I was involved in the '1641 Depositions' project http://www.1641.tcd.ie/
You can't split OCR layers without very fancy work (As in the Nasa/Planck CMB). I had hoped otherwise, because on my box, I get the right half of screen as background, then the left half as background, then the type. So I thought it might be separate.

I'll just minimize my use of that stuff, and suffer when I have to.

EDIT: I took a look at the Plain Text. It's just like the pdftotext output I was bellyaching about in post #1
Indeed, but strangely enough, the text that they embedded into the PDF and the text that they provide for the Plain Text version are different versions of OCR output. Go figure.
 
Old 11-22-2018, 02:06 PM   #8
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
Quote:
Originally Posted by business_kid View Post
I avoid epub when I can, because calibre really sucks
i can look at .epub with e.g. mupdf.
 
1 members found this post helpful.
Old 11-23-2018, 05:27 AM   #9
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,144

Original Poster
Rep: Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308
I actually use that ancient stuff very little. Mainly I am checking references on an obtuse subject, that's all. There were surprising advances in understanding ancient dead languages in the 1850s and much was done, also on Mammoth discovery in the Beringia (<70 degrees North). But they're only sideline references. I won't be repeating that every week!

I had no idea mupdf handled epub, and I shoehorned that in to my system a few months back. It is pretty bad, but what it does, it does quickly. If it does epub, great. I'll try it next time.
 
Old 11-23-2018, 05:47 AM   #10
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,144

Original Poster
Rep: Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308
Quote:
Originally Posted by hydrurga View Post
Indeed, but strangely enough, the text that they embedded into the PDF and the text that they provide for the Plain Text version are different versions of OCR output. Go figure.
That's what intrigues me. I wanted the black text off the colour, but pdftotext using ghostscript can't get it. I think they typed the text from OCR, editing as necessary, added a background, and never thought of the consequential load displaying it. They may have done it on Macs, which are good in this area if no other. In short I think the text is under the background, not over it . OCR is never that straight page after page. It must be typed. They should have put that type as text, but they didn't, and apparently never checked their text files.
 
1 members found this post helpful.
Old 11-23-2018, 08:56 AM   #11
hydrurga
LQ Guru
 
Registered: Nov 2008
Location: Pictland
Distribution: Linux Mint 21 MATE
Posts: 8,048
Blog Entries: 5

Rep: Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925
Quote:
Originally Posted by business_kid View Post
That's what intrigues me. I wanted the black text off the colour, but pdftotext using ghostscript can't get it. I think they typed the text from OCR, editing as necessary, added a background, and never thought of the consequential load displaying it. They may have done it on Macs, which are good in this area if no other. In short I think the text is under the background, not over it . OCR is never that straight page after page. It must be typed. They should have put that type as text, but they didn't, and apparently never checked their text files.
No, I've seen enough scans of old documents in my time - the images you see in the pdf are the scans of the original documents. They didn't add any backgrounds at all - that is the paper belonging to the original document. However they also appear to have passed the scans through an OCR program and embedded the results into the PDF as an invisible layer.

What I was referring to is that they have used the results of different OCR scans for the text that is embedded and the text supplied in the Plain Text format.
 
Old 11-24-2018, 05:40 AM   #12
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,144

Original Poster
Rep: Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308Reputation: 2308
If what you say is correct (and I've no reason not to believe you) I would think
  1. You have to stand back and admirer their OCR work. I spot an angle, but they're all 100% straight. And the binding interferes hugely with that over 924 pages. Further, the margins left and right are perfect and identically sized. That's unique. No yellowing, no dirt, no creases in over 150 years?? And no shading in the scan?
  2. Is there a way of accessing the invisible layer?
 
Old 11-24-2018, 07:36 AM   #13
hydrurga
LQ Guru
 
Registered: Nov 2008
Location: Pictland
Distribution: Linux Mint 21 MATE
Posts: 8,048
Blog Entries: 5

Rep: Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925
Quote:
Originally Posted by business_kid View Post
If what you say is correct (and I've no reason not to believe you) I would think
Is there a way of accessing the invisible layer?
Yes. pdftotext extracts it for you. On further reading, it's not a layer as such, it's just text that is embedded during the OCR scan process using a text rendering mode that leaves it effectively invisible.

Last edited by hydrurga; 11-24-2018 at 07:37 AM.
 
Old 11-24-2018, 07:39 AM   #14
hydrurga
LQ Guru
 
Registered: Nov 2008
Location: Pictland
Distribution: Linux Mint 21 MATE
Posts: 8,048
Blog Entries: 5

Rep: Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925
Quote:
Originally Posted by business_kid View Post
You have to stand back and admirer their OCR work. I spot an angle, but they're all 100% straight. And the binding interferes hugely with that over 924 pages. Further, the margins left and right are perfect and identically sized. That's unique. No yellowing, no dirt, no creases in over 150 years?? And no shading in the scan?
Why don't you write to the Smithsonian and ask them? I for one would be interested in the answer.
 
Old 11-25-2018, 02:53 AM   #15
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
Quote:
Originally Posted by hydrurga View Post
Yes. pdftotext extracts it for you. On further reading, it's not a layer as such, it's just text that is embedded during the OCR scan process using a text rendering mode that leaves it effectively invisible.
the weird thing is that in the PDF, i can select the text and it gets copied to the clipboard.
so the embedded text also contains information about its position; i would call that a hidden layer.
i am currently trying to convert just 10 pages to HTML, see what it consists of in the end. it's taking a long time. failed.

Last edited by ondoho; 11-25-2018 at 03:12 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Qemu mangling memory in host royceH Linux - Desktop 3 01-09-2007 01:40 PM
Slackware iptable/packet mangling help needed. underscorelinux Linux - Wireless Networking 3 10-23-2005 04:27 PM
mangling icmp tos primo Linux - Security 4 06-16-2005 06:04 PM
Easytag track name mangling adam_mccullough Linux - Software 1 04-30-2005 11:51 AM
Samba 3.0.2 not mangling long names correctly for Win95/98 JLindenmuth Linux - Networking 0 03-25-2004 01:28 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Desktop

All times are GMT -5. The time now is 04:34 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration