LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 06-16-2020, 04:40 AM   #1
fdq09eca
LQ Newbie
 
Registered: Jun 2020
Posts: 2

Rep: Reputation: Disabled
Smile pdfgrep vertical text


Hi, I am very new to linux.. please bare with me if I am asking stupid question.

I have a bunch of pdf that may not be written in format (randomly downloaded from google scholar). I am trying to check through their data source, so I used pdfgrep for the task.

It was successful most of the time until I found out that are some tables vertically placed. I attempt to rotate before grep-ing them
Code:
pdftk pdfs/25770596.pdf cat 8east output pdfs/25770596_r.pdf
but it makes no difference.

then I tried

Code:
pdftotext -f 1 -l 8 pdfs/25770596_r.pdf pdfs/25770596r_txt.txt
which actually served my purpose. It rendered some caption line in the vertical page but the vertical table is messed up, the numbers are in chaos.. and there are some numbers missing.

I would like to know if there is any more elegant way to complete the task?

The .pdf is here.

Thank you.
 
Old 06-16-2020, 09:19 AM   #2
remmilou
Member
 
Registered: Mar 2010
Location: Amsterdam
Distribution: MX Linux (21)/ XFCE
Posts: 212

Rep: Reputation: 69
Hi fdq09eca,

The link to your PDF gives a preview only and requires to log in for the full PDF.
If you can (legally accepted) give me access to the full PDF, I'm willing to do some testing for you with a completely different tool.
I'm talking about Apache TIKA. Open source under Apache license. I used this heavily in forensics.
Of course yo can download (https://tika.apache.org/download.html) and try for yourself.
It is developed for the extraction of text and/or metadata from loads of filetypes, including PDF. It can output txt, html, xml, json.
 
Old 06-16-2020, 10:19 AM   #3
shruggy
Senior Member
 
Registered: Mar 2020
Posts: 3,670

Rep: Reputation: Disabled
This is an article from 1995. E.g. here the text layer has only the title. If the version at JSTOR was scanned from paper and then OCRed then the text you get from it is only as good as the OCR layer saved in the PDF. Rotating the pages with pdftk, mutool etc. won't change anything if the tables weren't OCRed properly in the first place.
 
1 members found this post helpful.
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
pdfgrep vertical text fdq09eca Linux - Newbie 1 06-16-2020 06:30 AM
Finding The Complete SQL statement Using PDFGREP Or Grep metallica1973 Linux - General 4 02-03-2020 08:46 AM
LXer: GIMP 2.10.6 Released With Vertical Text Support, Other Improvements LXer Syndicated Linux News 0 08-20-2018 08:53 AM
LXer: How to Search PDF Files from the Terminal with pdfgrep LXer Syndicated Linux News 0 12-13-2017 11:51 AM
Vertical strips of fuzzy text with GEForce 6200 Card sporks Linux - Hardware 2 08-26-2007 05:56 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 01:08 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration