Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have split several hundred pdf files into individual pages, for OCRing.
The OCR result will be text files, one per page.
They will be named <PDF file>_<page-NNNN>.txt.
I need to combine these text files back into files that represent the OCR result of the multi-page PDF originals.
How might I process the recombining using the shell? I would imagine some sort of recursive `find`, with exec'ing.
I could just hack up a perl script, but this could probably be a one-liner....
thx, I need to do this recursively for thousands of files, which are the result of splitting about 500 pdfs into individual pages.
I had imagined that `cat` would be involved, as in
find / -regex '(.*)[0-9]{4}.txt' -type f | xargs cat >> $1.txt
but AFAIK you cant get $1 from the regex grouping to use in `cat`.
I need to put humpty-dumpty back together again, about 500 times...
for FILE in original_pdfs_*.pdf; do cat ${FILE%.pdf}_*.txt > ${FILE%.pdf}.txt; done
Assuming the original pdf file names have the same base as the output of the OCRing and you can discriminate those from the OCR output, then you just strip the .pdf name off the FILE variable, then cat that stripped filename with a wildcard (either * or a series of ?) into the original pdf file name with .txt as the extension instead of .pdf.
It's difficult to explain the concept I have in my head, but I hope this helps a little!
Have a look at the info file for find. It describes how to pass the arguments to the exec.
lol googling 'man find' gives some interesting results
So, given that '{}' is used to represent the current 'found' filename, I could use something like:
ex. pdf: pdffile_0001.txt
find . -regex '.*_[0-9]{4}.txt' -type f -exec \ // find a matching file
cat {} >> $( \
sed -e 's/_[0-9]{4}//' {} | \ // extract the prefix-part of filename
)
This means that I need to use '{}' twice; once as an arg to cat and again in sed, to get the part of the filename that I want, which should be interpreted as the shell as another argument to cat?
for ORGFILE in *page-001.pdf; do cat ${ORGFILE%%_page-*.pdf}_*txt > ${ORGFILE%%_page-*.pdf}.txt ; done
I am assuming that all files are in one directory (original, single pages pdf and single page txt).
Should work, tried it.
interesting way to go about it. I will try it, thank you very much.
the last time I used 'for' on the command line it was using windows shell, which for 'for' is a freaking nightmare
for ORGFILE in *page-001.pdf; do cat ${ORGFILE%%_page-*.pdf}_*txt > ${ORGFILE%%_page-*.pdf}.txt ; done
I am assuming that all files are in one directory (original, single pages pdf and single page txt).
Should work, tried it.
Thanks druuna.
For the record, this magic is accomplished using bash parameter expansion/substring removal (the %% stuff) (which I just learned, thankyouverymuch!).
Ref: http://bash-hackers.org/wiki/doku.php/syntax/pe
This works perfectly.
I wished that google didnt ignore non-alphanum characters in search, this would have taken much less time to figure out...
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.