Latest LQ Deal: Complete CCNA, CCNP & Red Hat Certification Training Bundle
Go Back > Forums > Linux Forums > Linux - Software
User Name
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.


  Search this Thread
Old 11-14-2012, 06:15 PM   #1
Registered: Mar 2010
Distribution: Slackware
Posts: 562

Rep: Reputation: 66
What's a good way to parse a PDF?

I do the books for my condominium association. The city/county combined billing for water, sewage, trash hauling, and recycling. They make the bills available as PDFs. For years I converted them to text files, extracted the data and inserted it into a spreadsheet with a script (they don't make the bill available in any format other than PDF). A year ago they started using new formats to write their PDFs,ones that aren't predictable. The PDFs look the same but the amounts associated with what they bill for aren't in the same spot every month; in fact sometimes the last item on the bill has its name on the last page of the PDF and text conversion but the amount on the first page. I assume they've gone to a columnar format that the conversion doesn't get 'right'. I've tried conversions from ghostscript, xpdf, Open Office, and acrobat. Has someone another suggestion?
Old 11-15-2012, 04:52 AM   #2
Senior Member
Registered: Dec 2003
Location: Trondheim, Norway
Distribution: Debian and Ubuntu
Posts: 1,293

Rep: Reputation: 335Reputation: 335Reputation: 335Reputation: 335

The utility "pdftotext" has two options you could try: -raw and -layout. (In addition to "default" which usually works best.) Maybe try these options and see if it's more predictable? Or maybe you can find a pattern so you can use tools like grep to find the things you need?

On Debian systems it's in the package "poppler-utils". I don't know about Slackware, maybe you find it if you search for poppler?
1 members found this post helpful.
Old 11-16-2012, 12:37 PM   #3
Registered: Mar 2010
Distribution: Slackware
Posts: 562

Original Poster
Rep: Reputation: 66
I was using pdftotext. I hadn't ever read its man page. -raw does the job.



Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
Any good freeware PDF editors? athomas Linux - Software 4 06-07-2010 11:47 PM
looking for a good pdf library website icedown General 1 03-31-2009 10:16 AM
Need a Good PDF reader for Sol10 X86 as400 Solaris / OpenSolaris 8 03-27-2009 08:49 AM
Good up todate Solaris pdf book: kebabbert Solaris / OpenSolaris 0 06-03-2008 02:18 PM
a good pdf view similar to acrobat? spyghost Linux - Software 8 09-06-2003 05:26 AM > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 12:33 AM.

Main Menu
Write for LQ is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration