LinuxQuestions.org
Register a domain and help support LQ
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices

Reply
 
Search this Thread
Old 11-14-2012, 06:15 PM   #1
RandomTroll
Member
 
Registered: Mar 2010
Distribution: Slackware
Posts: 268

Rep: Reputation: 27
What's a good way to parse a PDF?


I do the books for my condominium association. The city/county combined billing for water, sewage, trash hauling, and recycling. They make the bills available as PDFs. For years I converted them to text files, extracted the data and inserted it into a spreadsheet with a script (they don't make the bill available in any format other than PDF). A year ago they started using new formats to write their PDFs,ones that aren't predictable. The PDFs look the same but the amounts associated with what they bill for aren't in the same spot every month; in fact sometimes the last item on the bill has its name on the last page of the PDF and text conversion but the amount on the first page. I assume they've gone to a columnar format that the conversion doesn't get 'right'. I've tried conversions from ghostscript, xpdf, Open Office, and acrobat. Has someone another suggestion?
 
Old 11-15-2012, 04:52 AM   #2
Guttorm
Senior Member
 
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 1,135

Rep: Reputation: 230Reputation: 230Reputation: 230
Hi

The utility "pdftotext" has two options you could try: -raw and -layout. (In addition to "default" which usually works best.) Maybe try these options and see if it's more predictable? Or maybe you can find a pattern so you can use tools like grep to find the things you need?

On Debian systems it's in the package "poppler-utils". I don't know about Slackware, maybe you find it if you search for poppler?
 
1 members found this post helpful.
Old 11-16-2012, 12:37 PM   #3
RandomTroll
Member
 
Registered: Mar 2010
Distribution: Slackware
Posts: 268

Original Poster
Rep: Reputation: 27
I was using pdftotext. I hadn't ever read its man page. -raw does the job.

Thanks.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Any good freeware PDF editors? athomas Linux - Software 4 06-07-2010 11:47 PM
looking for a good pdf library website icedown General 1 03-31-2009 10:16 AM
Need a Good PDF reader for Sol10 X86 as400 Solaris / OpenSolaris 8 03-27-2009 08:49 AM
Good up todate Solaris pdf book: kebabbert Solaris / OpenSolaris 0 06-03-2008 02:18 PM
a good pdf view similar to acrobat? spyghost Linux - Software 8 09-06-2003 05:26 AM


All times are GMT -5. The time now is 02:04 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration