LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 05-29-2008, 02:34 PM   #1
lixbie
LQ Newbie
 
Registered: Dec 2006
Location: Caribbean
Distribution: Debian
Posts: 22

Rep: Reputation: 15
Batch Text Extract Multiple Files


Hi All & thanks for reading this thread

I have 1000+ .pdf files wrongly renamed after I undelled them. I've looked around and found a .pdf to .txt batch converter so now I also have 1000+ .txt files with the file name in their second line first five character string. I would like to have the .pdf files batch renamed after their second line first 5 character string.

Example:

1. BATCH INPUT: a folder containing both 1000+ pdf and 1000+ .txt files like:
Undeleted001.pdf & Undeleted001.txt
Undeleted002.pdf & Undeleted002.txt
2. BATCH NEEDS TO EXTRACT either from .pdf or .txt files the first five characters of the second line for each file like:
UNDELETED001.PDF (.TXT) 1st five characters, 2nd Line:"12345"
UNDELETED002.PDF (.TXT) 1st five characters, 2nd Line:"ABCDE"
3. BATCH WILL RENAME ALL PDF FILES:
Undeleted001.pdf ---------->12345.pdf
Undeleted002.pdf ---------->ABCDE.pdf

I've seen some text extract script example using the cat and the sed commands but they apply only to one file and you have to write in the file name.

I have some Basic, VBA and Matlab programming knowlledge, I am currently reading on C/C++ and bash, and am willing to learn either pearl or phyton or lisp or all. I started traveling a one way road that has forked tremendously but am determined to get there. Some people do crosswords, I like programming.

Can you please: provide a road map of the commands to do this in the programming language of choice. Like:

Using Language "Language"
Step1: Batch Input: commands "this/that" options "this/that"
Step2: Text Extraction: commands "this/that" options "this/that"
Step3: File Rename: commands "this/that" options "this/that"


Feel free to provide the complete code, but if you do so I won't have to read on how to use those commands and what those options do.

Thanks in Advance for your help

Linux Newbie
 
Old 05-29-2008, 03:45 PM   #2
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: FreeBSD, Debian, Mint, Puppy
Posts: 3,290

Rep: Reputation: 174Reputation: 174
try this...
Code:
#!/usr/bin/perl


while (<ARGV>) {
    if ($. == 2) {           # line nr 2

        ($nm) = m/(.{5})/;            # rip the name up and make a command
        print STDOUT "cp $ARGV $nm.txt\n";
        close ARGV;     # effectively opens the next file
    }
}
it prints on stdout so you can check it first,
Code:
$ ./prog.pl *.txt
cp 1.pdf fruit.pdf
cp 2.pdf fibre.pdf
cp 3.pdf ringo.pdf
NOTE: i've left it as cp rather than mv

this is a technique i use a lot, do it like this:

Code:
$ prog.pl  > 1.sh   # save to a text file and check it's ok
$ cat 1.sh
cp 1.pdf fruit.pdf
cp 2.pdf fibre.pdf
cp 3.pdf ringo.pdf

$ sh 1.sh           # run it as a script, bonus is it leaves a record too
 
Old 05-30-2008, 08:05 PM   #3
lixbie
LQ Newbie
 
Registered: Dec 2006
Location: Caribbean
Distribution: Debian
Posts: 22

Original Poster
Rep: Reputation: 15
Dear bigearsbilly:

Thank you for your fast response. You did post the complete script (I think) and more because of the record keeping; still I have to go and find out how this works. Thanks this will keep me busy for a few days.


Thansk Again

Lixbie

Last edited by lixbie; 05-30-2008 at 08:08 PM.
 
Old 06-02-2008, 04:07 PM   #4
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: FreeBSD, Debian, Mint, Puppy
Posts: 3,290

Rep: Reputation: 174Reputation: 174
it works as i understand the problem.
i think!
 
Old 06-30-2008, 09:13 PM   #5
lixbie
LQ Newbie
 
Registered: Dec 2006
Location: Caribbean
Distribution: Debian
Posts: 22

Original Poster
Rep: Reputation: 15
Dear bigearsbilly:

I hope that you get to read this. After seeing your solution, reading some of O'Reilly's "Programming Perl" and stumbling upon some stones I managed to create attached scripts.

Thansk Again
Lixbie

1st Stone: Some files were *.pdf and others *.PDF
Code:

Quote:
#Change PDF to pdf in files
#!/usr/bin/perl #Use as ./ren1ch.pl *.PDF > ren.sh
print STDOUT "#!/usr/bin/sh\n";
while (<ARGV>) # While ($_ = <ARGV>)
{
($name) = $ARGV; #$name is name of first *.PDF file
$name =~ s/PDF/pdf/g; # Change Pdf to pdf in $name
print STDOUT "mv $ARGV $name\n\n"; # move old file to new file
close ARGV; # backto top
}
2nd Stone: The name of the file wasn't always on the 5th line but always started with a 12 character long text.
3rd: Large text neded to be abreviated
4th: Spaces had to be changed to _
5th: Found some non Word Characters in the text
6th: Some files where split and had the similar names so one was overriding the other and I was getting 90% of the original files. Added the counter and OK.
Code:
Quote:
#Rename all pdf files to $name acording to line in *.txt file
#!/usr/bin/perl #Use it as ./ren2ren *.txt > ren.sh
while (<ARGV>) # While ($_ = <ARGV>)
{
if (/texttext:/) # line contains my text:
{
($name) = substr($_, 12);
if ($name =~ /\s(.*)\b/)
{
$name = $1;
$name =~ s/[some large text]/\0/g; # Abreviate
$name =~ s/\W+/\_/g;
$name =~ s/_+/_/g;
($ARGV2)= $ARGV; # Keep ARGV... still testing
$ARGV2 =~ s/txt/pdf/g;
$AI++;
print STDOUT "cp $ARGV2 $name\_F$AI.pdf\n\n";
print STDOUT "mv $ARGV\t\t\txt/$ARGV\n";
print STDOUT "mv $ARGV2\t\t\pdf/$ARGV2\n\n";
close ARGV; #back to top
}
}
}
 
Old 06-30-2008, 10:10 PM   #6
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,696
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
shell
Code:
for PDF in *.pdf
do
 txt="${PDF%*.pdf}.txt"
 if [ -e ${txt} ]; then
   fivechar=$(awk 'NR==2{print substr($0,1,5)}' "${txt}")
   mv "$PDF" "${fivechar}.pdf"   
   # add code to remove remaining txt files
 fi 
done
Quote:
Feel free to provide the complete code, but if you do so I won't have to read on how to use those commands and what those options do.
you don't have to read on how to use them? that's not the way to learn.

Last edited by ghostdog74; 06-30-2008 at 10:12 PM.
 
Old 06-30-2008, 10:17 PM   #7
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,289

Rep: Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034
Looks like you're going with Perl, so bookmark this: http://perldoc.perl.org/
 
Old 07-01-2008, 08:24 AM   #8
lixbie
LQ Newbie
 
Registered: Dec 2006
Location: Caribbean
Distribution: Debian
Posts: 22

Original Poster
Rep: Reputation: 15
Thank Gostdog 74 & Chrism01 for your reply.

Ghost: Is that bash shell? I don't understand clearly the purpose of the script but I guess that it changes the uppercase PDF for lowercase pdf? Right?

Chris: Thanks for the link, very good one. One thing I haven't been able to find is how to break long lines in perl without trouble because I would like to and tab subroutines to the right and keep all text in the screen and all my comments in one neat column; but when you have the following line tabbed three times

$name =~ s/[some very very very very large text]/\0/g; # Abreviate

the comment goes out of the screen to the right.

Is there a way to break this line in perl like this

$name =~ s/[some very very very
very large text]/\0/g; # Abreviate
 
Old 07-01-2008, 11:22 AM   #9
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,696
Blog Entries: 5

Rep: Reputation: 241Reputation: 241Reputation: 241
Quote:
Originally Posted by lixbie View Post
Thank Gostdog 74 & Chrism01 for your reply.

Ghost: Is that bash shell? I don't understand clearly the purpose of the script but I guess that it changes the uppercase PDF for lowercase pdf? Right?
yes its shell(bash) and no, it doesn't change upper case PDF to lowercase. it goes through all your pdf files, check for corresponding .txt file of the same name, get the 1st 5 chars of the second line of the corresponding text file and use it to rename the pdf file. If you are interested , see my sig for bash link
 
Old 07-01-2008, 08:12 PM   #10
chrism01
Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.5, Centos 5.10
Posts: 16,289

Rep: Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034Reputation: 2034
Well, you can store a string in a var, then interpolate into the regex. See the tutorial here: http://perldoc.perl.org/perlretut.html, specifically 'More on characters, strings, and character classes'.
To assign a long string in sections, use the '.' concat operator http://perldoc.perl.org/perlop.html#Constant-Folding

eg
Code:
$str= "asdfgh".
      "zxcvb";

# same as
$str="asdfghzxcvb";
PS always start your progs with

#!/usr/bin/perl -w
use strict;

Those 2 strictures ( -w = warnings & use strict) enforce proper coding eg declarations and warn you of dodgy/broken syntax.
To syntax test a perl prog use

>perl -wc prog.pl
which will do a test compile without running it.

Last edited by chrism01; 07-01-2008 at 08:15 PM.
 
Old 07-02-2008, 09:56 AM   #11
lixbie
LQ Newbie
 
Registered: Dec 2006
Location: Caribbean
Distribution: Debian
Posts: 22

Original Poster
Rep: Reputation: 15
Thank you All for your help.

With your assistance and 40 hours of work I managed you produce the previously posted scripts. With them I could rename 24,000+ pdf files.

I didn't know that bash could also be used. Nevertheless it is still in my "to pursue" list.

Thanks Again
Lixbie
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
script to extract blocks of text from many files. gruessle Programming 4 10-19-2007 02:31 AM
Help with script to batch edit text files OnoTadaki Programming 5 10-15-2007 02:44 PM
How to extract Text from RTF files (or even DOC) SkipHuffman Linux - Software 5 03-02-2007 12:57 PM
Extract multiple bz2 files hq4ever Linux - Newbie 4 06-24-2004 12:03 AM
extract text portions from html files linuxfond Programming 3 04-28-2004 11:00 AM


All times are GMT -5. The time now is 03:14 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration