[SOLVED] What's a good way to parse a PDF?

RandomTroll · 11-14-2012, 06:15 PM

I do the books for my condominium association. The city/county combined billing for water, sewage, trash hauling, and recycling. They make the bills available as PDFs. For years I converted them to text files, extracted the data and inserted it into a spreadsheet with a script (they don't make the bill available in any format other than PDF). A year ago they started using new formats to write their PDFs,ones that aren't predictable. The PDFs look the same but the amounts associated with what they bill for aren't in the same spot every month; in fact sometimes the last item on the bill has its name on the last page of the PDF and text conversion but the amount on the first page. I assume they've gone to a columnar format that the conversion doesn't get 'right'. I've tried conversions from ghostscript, xpdf, Open Office, and acrobat. Has someone another suggestion?

Guttorm · 11-15-2012, 04:52 AM

Hi

The utility "pdftotext" has two options you could try: -raw and -layout. (In addition to "default" which usually works best.) Maybe try these options and see if it's more predictable? Or maybe you can find a pattern so you can use tools like grep to find the things you need?

On Debian systems it's in the package "poppler-utils". I don't know about Slackware, maybe you find it if you search for poppler?

RandomTroll · 11-16-2012, 12:37 PM

I was using pdftotext. I hadn't ever read its man page. -raw does the job.

Thanks.