combine text files with a regex

itzfritz · 06-18-2009, 09:47 AM

I have split several hundred pdf files into individual pages, for OCRing.
The OCR result will be text files, one per page.
They will be named <PDF file>_<page-NNNN>.txt.

I need to combine these text files back into files that represent the OCR result of the multi-page PDF originals.

Ex:
one.pdf -(split)->
one_page-0001.pdf, one_page-0002.pdf -(ocr)->
one_page-001.txt, one_page-0002.txt -(recombine)->
one.txt

How might I process the recombining using the shell? I would imagine some sort of recursive `find`, with exec'ing.
I could just hack up a perl script, but this could probably be a one-liner....

Thanks!

druuna · 06-18-2009, 10:12 AM

Hi,

This should work:

cat one_page-*.txt > one.txt

itzfritz · 06-18-2009, 10:27 AM

Quote:

Originally Posted by druuna

Hi,

This should work:

cat one_page-*.txt > one.txt

thx, I need to do this recursively for thousands of files, which are the result of splitting about 500 pdfs into individual pages.
I had imagined that `cat` would be involved, as in
find / -regex '(.*)[0-9]{4}.txt' -type f | xargs cat >> $1.txt
but AFAIK you cant get $1 from the regex grouping to use in `cat`.

I need to put humpty-dumpty back together again, about 500 times...

pwc101 · 06-18-2009, 10:43 AM

Perhaps a for loop instead?

Code:

for FILE in original_pdfs_*.pdf; do cat ${FILE%.pdf}_*.txt > ${FILE%.pdf}.txt; done

Assuming the original pdf file names have the same base as the output of the OCRing and you can discriminate those from the OCR output, then you just strip the .pdf name off the FILE variable, then cat that stripped filename with a wildcard (either * or a series of ?) into the original pdf file name with .txt as the extension instead of .pdf.

It's difficult to explain the concept I have in my head, but I hope this helps a little!

PTrenholme · 06-18-2009, 10:43 AM

Have a look at the info file for find. It describes how to pass the arguments to the exec.

druuna · 06-18-2009, 10:46 AM

Hi,

Something like this:

Code:

for ORGFILE in *page-001.pdf; do cat ${ORGFILE%%_page-*.pdf}_*txt > ${ORGFILE%%_page-*.pdf}.txt ; done

I am assuming that all files are in one directory (original, single pages pdf and single page txt).

Should work, tried it.

itzfritz · 06-18-2009, 10:58 AM

Quote:

Originally Posted by PTrenholme

Have a look at the info file for find. It describes how to pass the arguments to the exec.

lol googling 'man find' gives some interesting results

So, given that '{}' is used to represent the current 'found' filename, I could use something like:
ex. pdf: pdffile_0001.txt

find . -regex '.*_[0-9]{4}.txt' -type f -exec \ // find a matching file
cat {} >> $( \
sed -e 's/_[0-9]{4}//' {} | \ // extract the prefix-part of filename
)

This means that I need to use '{}' twice; once as an arg to cat and again in sed, to get the part of the filename that I want, which should be interpreted as the shell as another argument to cat?

Thanks for your help.

itzfritz · 06-18-2009, 10:59 AM

Quote:

Originally Posted by druuna

Hi,

Something like this:

Code:

for ORGFILE in *page-001.pdf; do cat ${ORGFILE%%_page-*.pdf}_*txt > ${ORGFILE%%_page-*.pdf}.txt ; done

I am assuming that all files are in one directory (original, single pages pdf and single page txt).

Should work, tried it.

interesting way to go about it. I will try it, thank you very much.
the last time I used 'for' on the command line it was using windows shell, which for 'for' is a freaking nightmare

itzfritz · 06-18-2009, 01:02 PM

Quote:

Originally Posted by druuna

Hi,

Something like this:

Code:

for ORGFILE in *page-001.pdf; do cat ${ORGFILE%%_page-*.pdf}_*txt > ${ORGFILE%%_page-*.pdf}.txt ; done

I am assuming that all files are in one directory (original, single pages pdf and single page txt).

Should work, tried it.

Thanks druuna.

For the record, this magic is accomplished using bash parameter expansion/substring removal (the %% stuff) (which I just learned, thankyouverymuch!).
Ref: http://bash-hackers.org/wiki/doku.php/syntax/pe

This works perfectly.

I wished that google didnt ignore non-alphanum characters in search, this would have taken much less time to figure out...

colucix · 06-18-2009, 01:09 PM

As a reference you can take a look at the Advanced Bash Scripting Guide: http://tldp.org/LDP/abs/html/index.html.

druuna · 06-18-2009, 01:26 PM

Quote:

Thanks druuna.

You're welcome