LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 08-29-2013, 07:57 PM   #1
theobald7
LQ Newbie
 
Registered: Aug 2013
Posts: 1

Rep: Reputation: Disabled
pdftotext formatting issue


Greetings Linux Folk, long time lurker and answer-seeker am I, here as a first-time poster.

I've got a number of pdf files with the information organized in four columns, two field names and associated data fields:

Code:
JOB TYPE         Primary    JOB STATUS      Active
JOB START DATE   6/8/11     JOB END DATE
The text is very well ordered throughout these files and well formatted when viewed as a pdf. I would like to parse this data to a database, so I transform the files into the text equivalents with "pdftotext -layout <filename>". Some files convert as expected, with the data following the field name on the same line. Others, however, convert to the form:

Code:
JOB TYPE                    JOB STATUS
                 Primary                    Active
JOB START DATE              JOB END DATE
                 6/8/11
Moreover, this is not consistent within the file. Some lines have the data on the same line as the field name, others are offset as shown above.

I think this may be a property of the pdf file, but would like to know how to correct the transformation behavior. This would make parsing the resultant text files much easier.

Any suggestions?
 
Old 08-30-2013, 01:21 AM   #2
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198Reputation: 1198
You could try Apache's "Tika!", currently at version 1.4. According to http://tika.apache.org/1.4/formats.html it supports Portable Document Format.

The easiest way to install it is to use the pre-built .jar, downloadable from http://repo2.maven.org/maven2/org/ap.../tika-app/1.4/
 
Old 09-04-2013, 01:26 AM   #3
eklavya
Member
 
Registered: Mar 2013
Posts: 633

Rep: Reputation: 141Reputation: 141
Did you try on line converters?
http://www.convertmypdf.net/
http://document.online-convert.com/convert-to-txt
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Weird Formatting Issue vindicate Linux - Software 3 03-12-2010 08:03 AM
pdftotext consumes 100%cpu power alaios Linux - Software 1 04-02-2007 06:27 PM
HD formatting issue with debian cake81 Linux - Hardware 7 02-27-2006 09:45 AM
pdftotext - How to output to html with ampersand entities ? narc Linux - Software 2 01-04-2006 02:34 PM
xpdf, pdftotext phoenix7 General 7 09-08-2005 02:54 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 10:59 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration