LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 11-14-2012, 06:15 PM   #1
RandomTroll
Senior Member
 
Registered: Mar 2010
Distribution: Slackware
Posts: 1,970

Rep: Reputation: 271Reputation: 271Reputation: 271
What's a good way to parse a PDF?


I do the books for my condominium association. The city/county combined billing for water, sewage, trash hauling, and recycling. They make the bills available as PDFs. For years I converted them to text files, extracted the data and inserted it into a spreadsheet with a script (they don't make the bill available in any format other than PDF). A year ago they started using new formats to write their PDFs,ones that aren't predictable. The PDFs look the same but the amounts associated with what they bill for aren't in the same spot every month; in fact sometimes the last item on the bill has its name on the last page of the PDF and text conversion but the amount on the first page. I assume they've gone to a columnar format that the conversion doesn't get 'right'. I've tried conversions from ghostscript, xpdf, Open Office, and acrobat. Has someone another suggestion?
 
Old 11-15-2012, 04:52 AM   #2
Guttorm
Senior Member
 
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 1,453

Rep: Reputation: 447Reputation: 447Reputation: 447Reputation: 447Reputation: 447
Hi

The utility "pdftotext" has two options you could try: -raw and -layout. (In addition to "default" which usually works best.) Maybe try these options and see if it's more predictable? Or maybe you can find a pattern so you can use tools like grep to find the things you need?

On Debian systems it's in the package "poppler-utils". I don't know about Slackware, maybe you find it if you search for poppler?
 
1 members found this post helpful.
Old 11-16-2012, 12:37 PM   #3
RandomTroll
Senior Member
 
Registered: Mar 2010
Distribution: Slackware
Posts: 1,970

Original Poster
Rep: Reputation: 271Reputation: 271Reputation: 271
I was using pdftotext. I hadn't ever read its man page. -raw does the job.

Thanks.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Any good freeware PDF editors? athomas Linux - Software 4 06-07-2010 11:47 PM
looking for a good pdf library website icedown General 1 03-31-2009 10:16 AM
Need a Good PDF reader for Sol10 X86 as400 Solaris / OpenSolaris 8 03-27-2009 08:49 AM
Good up todate Solaris pdf book: kebabbert Solaris / OpenSolaris 0 06-03-2008 02:18 PM
a good pdf view similar to acrobat? spyghost Linux - Software 8 09-06-2003 05:26 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 08:05 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration