LinuxQuestions.org
Support LQ: Use code LQ3 and save $3 on Domain Registration
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Reply
 
Search this Thread
Old 06-18-2009, 09:47 AM   #1
itzfritz
Member
 
Registered: Oct 2004
Location: Babylon, New York
Distribution: debian lenny, ubuntu intrepid
Posts: 70

Rep: Reputation: 15
combine text files with a regex


I have split several hundred pdf files into individual pages, for OCRing.
The OCR result will be text files, one per page.
They will be named <PDF file>_<page-NNNN>.txt.

I need to combine these text files back into files that represent the OCR result of the multi-page PDF originals.

Ex:
one.pdf -(split)->
one_page-0001.pdf, one_page-0002.pdf -(ocr)->
one_page-001.txt, one_page-0002.txt -(recombine)->
one.txt

How might I process the recombining using the shell? I would imagine some sort of recursive `find`, with exec'ing.
I could just hack up a perl script, but this could probably be a one-liner....



Thanks!
 
Old 06-18-2009, 10:12 AM   #2
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371
Hi,

This should work:

cat one_page-*.txt > one.txt
 
Old 06-18-2009, 10:27 AM   #3
itzfritz
Member
 
Registered: Oct 2004
Location: Babylon, New York
Distribution: debian lenny, ubuntu intrepid
Posts: 70

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by druuna View Post
Hi,

This should work:

cat one_page-*.txt > one.txt
thx, I need to do this recursively for thousands of files, which are the result of splitting about 500 pdfs into individual pages.
I had imagined that `cat` would be involved, as in
find / -regex '(.*)[0-9]{4}.txt' -type f | xargs cat >> $1.txt
but AFAIK you cant get $1 from the regex grouping to use in `cat`.

I need to put humpty-dumpty back together again, about 500 times...
 
Old 06-18-2009, 10:43 AM   #4
pwc101
Senior Member
 
Registered: Oct 2005
Location: UK
Distribution: Slackware
Posts: 1,846

Rep: Reputation: 128Reputation: 128
Perhaps a for loop instead?
Code:
for FILE in original_pdfs_*.pdf; do cat ${FILE%.pdf}_*.txt > ${FILE%.pdf}.txt; done
Assuming the original pdf file names have the same base as the output of the OCRing and you can discriminate those from the OCR output, then you just strip the .pdf name off the FILE variable, then cat that stripped filename with a wildcard (either * or a series of ?) into the original pdf file name with .txt as the extension instead of .pdf.

It's difficult to explain the concept I have in my head, but I hope this helps a little!
 
Old 06-18-2009, 10:43 AM   #5
PTrenholme
Senior Member
 
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,147

Rep: Reputation: 330Reputation: 330Reputation: 330Reputation: 330
Have a look at the info file for find. It describes how to pass the arguments to the exec.
 
Old 06-18-2009, 10:46 AM   #6
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371
Hi,

Something like this:
Code:
for ORGFILE in *page-001.pdf; do cat ${ORGFILE%%_page-*.pdf}_*txt > ${ORGFILE%%_page-*.pdf}.txt ; done
I am assuming that all files are in one directory (original, single pages pdf and single page txt).

Should work, tried it.
 
Old 06-18-2009, 10:58 AM   #7
itzfritz
Member
 
Registered: Oct 2004
Location: Babylon, New York
Distribution: debian lenny, ubuntu intrepid
Posts: 70

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by PTrenholme View Post
Have a look at the info file for find. It describes how to pass the arguments to the exec.
lol googling 'man find' gives some interesting results

So, given that '{}' is used to represent the current 'found' filename, I could use something like:
ex. pdf: pdffile_0001.txt

find . -regex '.*_[0-9]{4}.txt' -type f -exec \ // find a matching file
cat {} >> $( \
sed -e 's/_[0-9]{4}//' {} | \ // extract the prefix-part of filename
)

This means that I need to use '{}' twice; once as an arg to cat and again in sed, to get the part of the filename that I want, which should be interpreted as the shell as another argument to cat?

Thanks for your help.
 
Old 06-18-2009, 10:59 AM   #8
itzfritz
Member
 
Registered: Oct 2004
Location: Babylon, New York
Distribution: debian lenny, ubuntu intrepid
Posts: 70

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by druuna View Post
Hi,

Something like this:
Code:
for ORGFILE in *page-001.pdf; do cat ${ORGFILE%%_page-*.pdf}_*txt > ${ORGFILE%%_page-*.pdf}.txt ; done
I am assuming that all files are in one directory (original, single pages pdf and single page txt).

Should work, tried it.
interesting way to go about it. I will try it, thank you very much.
the last time I used 'for' on the command line it was using windows shell, which for 'for' is a freaking nightmare
 
Old 06-18-2009, 01:02 PM   #9
itzfritz
Member
 
Registered: Oct 2004
Location: Babylon, New York
Distribution: debian lenny, ubuntu intrepid
Posts: 70

Original Poster
Rep: Reputation: 15
Quote:
Originally Posted by druuna View Post
Hi,

Something like this:
Code:
for ORGFILE in *page-001.pdf; do cat ${ORGFILE%%_page-*.pdf}_*txt > ${ORGFILE%%_page-*.pdf}.txt ; done
I am assuming that all files are in one directory (original, single pages pdf and single page txt).

Should work, tried it.

Thanks druuna.

For the record, this magic is accomplished using bash parameter expansion/substring removal (the %% stuff) (which I just learned, thankyouverymuch!).
Ref: http://bash-hackers.org/wiki/doku.php/syntax/pe

This works perfectly.

I wished that google didnt ignore non-alphanum characters in search, this would have taken much less time to figure out...
 
Old 06-18-2009, 01:09 PM   #10
colucix
Moderator
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,458

Rep: Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941Reputation: 1941
As a reference you can take a look at the Advanced Bash Scripting Guide: http://tldp.org/LDP/abs/html/index.html.
 
Old 06-18-2009, 01:26 PM   #11
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371Reputation: 2371
Quote:
Thanks druuna.
You're welcome
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Need to combine text on to one line clstanton Linux - Newbie 10 05-12-2009 06:55 AM
Combine text to forms image carstenson Linux - Software 0 12-15-2004 04:56 PM
Combine number and text Ephracis Programming 3 11-17-2004 11:13 AM
Perl Regex Help -- Readin In Text Files smaida Programming 1 04-04-2004 11:27 PM
I need a program to combine text with gif files stephenoregan4 Linux - Software 1 02-16-2004 09:29 AM


All times are GMT -5. The time now is 12:24 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration