Linux - DesktopThis forum is for the discussion of all Linux Software used in a desktop context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have been digging some pdfs from the Smithsonian Archives and the pdfs are sad. They have a plain background, Kind of a pink/beige hybrid which was evidently inserted at 5000 dpi Lets take an example
Code:
-rw-r--r-- 1 dec users 5.6M Nov 20 20:22 SMC_131_Wetmore_1956_5_1-105.pdf
-rw-r--r-- 1 dec users 197K Nov 21 12:41 SMC_131_Wetmore_1956_5_1-105.txt
When the pdf is viewed, xpdf (a ghostscript thing) takes 3-10 seconds per page on my sad cpu, & sadder graphics. depending on zoom. I ran it through pdftotext, but
the
resulting
(71k)
file comes
out
like this.
And some chunks are clearly missing. Is there a way of stripping the background? Pictures don't appear to be an issue.
That background is the actual paper of the document that was scanned in order to create the PDF, at the given scan resolution. It wasn't inserted, it's necessarily part of the same layer as the visual text on the scanned document so will be impossible to remove without specialised processing. I assume that a separate OCR-generated hidden text layer has been added, which is what you see when you use pdftotext.
You would better finding a more effective PDF->Text translator (with good OCR) or download it in one of the other formats offered by the Smithsonian, e.g. Plain Text (although this just provides an OCR-generated document which evidently hasn't been checked for accuracy).
I avoid epub when I can, because calibre really sucks, imho. Text, as someone pointed out, is undeited OCR.
I did OCR work for a time, and to get it distortion free like that is and vertical/horizontal is basically impossible with any equipment I had access to. I was involved in the '1641 Depositions' project http://www.1641.tcd.ie/
You can't split OCR layers without very fancy work (As in the Nasa/Planck CMB). I had hoped otherwise, because on my box, I get the right half of screen as background, then the left half as background, then the type. So I thought it might be separate.
I'll just minimize my use of that stuff, and suffer when I have to.
EDIT: I took a look at the Plain Text. It's just like the pdftotext output I was bellyaching about in post #1
Last edited by business_kid; 11-22-2018 at 09:22 AM.
I avoid epub when I can, because calibre really sucks, imho. Text, as someone pointed out, is undeited OCR.
I did OCR work for a time, and to get it distortion free like that is and vertical/horizontal is basically impossible with any equipment I had access to. I was involved in the '1641 Depositions' project http://www.1641.tcd.ie/
You can't split OCR layers without very fancy work (As in the Nasa/Planck CMB). I had hoped otherwise, because on my box, I get the right half of screen as background, then the left half as background, then the type. So I thought it might be separate.
I'll just minimize my use of that stuff, and suffer when I have to.
EDIT: I took a look at the Plain Text. It's just like the pdftotext output I was bellyaching about in post #1
Indeed, but strangely enough, the text that they embedded into the PDF and the text that they provide for the Plain Text version are different versions of OCR output. Go figure.
I actually use that ancient stuff very little. Mainly I am checking references on an obtuse subject, that's all. There were surprising advances in understanding ancient dead languages in the 1850s and much was done, also on Mammoth discovery in the Beringia (<70 degrees North). But they're only sideline references. I won't be repeating that every week!
I had no idea mupdf handled epub, and I shoehorned that in to my system a few months back. It is pretty bad, but what it does, it does quickly. If it does epub, great. I'll try it next time.
Indeed, but strangely enough, the text that they embedded into the PDF and the text that they provide for the Plain Text version are different versions of OCR output. Go figure.
That's what intrigues me. I wanted the black text off the colour, but pdftotext using ghostscript can't get it. I think they typed the text from OCR, editing as necessary, added a background, and never thought of the consequential load displaying it. They may have done it on Macs, which are good in this area if no other. In short I think the text is under the background, not over it . OCR is never that straight page after page. It must be typed. They should have put that type as text, but they didn't, and apparently never checked their text files.
That's what intrigues me. I wanted the black text off the colour, but pdftotext using ghostscript can't get it. I think they typed the text from OCR, editing as necessary, added a background, and never thought of the consequential load displaying it. They may have done it on Macs, which are good in this area if no other. In short I think the text is under the background, not over it . OCR is never that straight page after page. It must be typed. They should have put that type as text, but they didn't, and apparently never checked their text files.
No, I've seen enough scans of old documents in my time - the images you see in the pdf are the scans of the original documents. They didn't add any backgrounds at all - that is the paper belonging to the original document. However they also appear to have passed the scans through an OCR program and embedded the results into the PDF as an invisible layer.
What I was referring to is that they have used the results of different OCR scans for the text that is embedded and the text supplied in the Plain Text format.
If what you say is correct (and I've no reason not to believe you) I would think
You have to stand back and admirer their OCR work. I spot an angle, but they're all 100% straight. And the binding interferes hugely with that over 924 pages. Further, the margins left and right are perfect and identically sized. That's unique. No yellowing, no dirt, no creases in over 150 years?? And no shading in the scan?
If what you say is correct (and I've no reason not to believe you) I would think
Is there a way of accessing the invisible layer?
Yes. pdftotext extracts it for you. On further reading, it's not a layer as such, it's just text that is embedded during the OCR scan process using a text rendering mode that leaves it effectively invisible.
You have to stand back and admirer their OCR work. I spot an angle, but they're all 100% straight. And the binding interferes hugely with that over 924 pages. Further, the margins left and right are perfect and identically sized. That's unique. No yellowing, no dirt, no creases in over 150 years?? And no shading in the scan?
Why don't you write to the Smithsonian and ask them? I for one would be interested in the answer.
Yes. pdftotext extracts it for you. On further reading, it's not a layer as such, it's just text that is embedded during the OCR scan process using a text rendering mode that leaves it effectively invisible.
the weird thing is that in the PDF, i can select the text and it gets copied to the clipboard.
so the embedded text also contains information about its position; i would call that a hidden layer. i am currently trying to convert just 10 pages to HTML, see what it consists of in the end. it's taking a long time. failed.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.