Batch Text Extract Multiple Files

lixbie · 05-29-2008, 02:34 PM

Hi All & thanks for reading this thread

I have 1000+ .pdf files wrongly renamed after I undelled them. I've looked around and found a .pdf to .txt batch converter so now I also have 1000+ .txt files with the file name in their second line first five character string. I would like to have the .pdf files batch renamed after their second line first 5 character string.

Example:

1. BATCH INPUT: a folder containing both 1000+ pdf and 1000+ .txt files like:
Undeleted001.pdf & Undeleted001.txt
Undeleted002.pdf & Undeleted002.txt
2. BATCH NEEDS TO EXTRACT either from .pdf or .txt files the first five characters of the second line for each file like:
UNDELETED001.PDF (.TXT) 1st five characters, 2nd Line:"12345"
UNDELETED002.PDF (.TXT) 1st five characters, 2nd Line:"ABCDE"
3. BATCH WILL RENAME ALL PDF FILES:
Undeleted001.pdf ---------->12345.pdf
Undeleted002.pdf ---------->ABCDE.pdf

I've seen some text extract script example using the cat and the sed commands but they apply only to one file and you have to write in the file name.

I have some Basic, VBA and Matlab programming knowlledge, I am currently reading on C/C++ and bash, and am willing to learn either pearl or phyton or lisp or all. I started traveling a one way road that has forked tremendously but am determined to get there. Some people do crosswords, I like programming.

Can you please: provide a road map of the commands to do this in the programming language of choice. Like:

Using Language "Language"
Step1: Batch Input: commands "this/that" options "this/that"
Step2: Text Extraction: commands "this/that" options "this/that"
Step3: File Rename: commands "this/that" options "this/that"

Feel free to provide the complete code, but if you do so I won't have to read on how to use those commands and what those options do.

Thanks in Advance for your help

Linux Newbie

bigearsbilly · 05-29-2008, 03:45 PM

try this...

Code:

#!/usr/bin/perl


while (<ARGV>) {
    if ($. == 2) {           # line nr 2

        ($nm) = m/(.{5})/;            # rip the name up and make a command
        print STDOUT "cp $ARGV $nm.txt\n";
        close ARGV;     # effectively opens the next file
    }
}

it prints on stdout so you can check it first,

Code:

$ ./prog.pl *.txt
cp 1.pdf fruit.pdf
cp 2.pdf fibre.pdf
cp 3.pdf ringo.pdf

NOTE: i've left it as cp rather than mv

this is a technique i use a lot, do it like this:

Code:

$ prog.pl  > 1.sh   # save to a text file and check it's ok
$ cat 1.sh
cp 1.pdf fruit.pdf
cp 2.pdf fibre.pdf
cp 3.pdf ringo.pdf

$ sh 1.sh           # run it as a script, bonus is it leaves a record too

lixbie · 05-30-2008, 08:05 PM

Dear bigearsbilly:

Thank you for your fast response. You did post the complete script (I think) and more because of the record keeping; still I have to go and find out how this works. Thanks this will keep me busy for a few days.

Thansk Again

Lixbie

bigearsbilly · 06-02-2008, 04:07 PM

it works as i understand the problem.
i think!

lixbie · 06-30-2008, 09:13 PM

Dear bigearsbilly:

I hope that you get to read this. After seeing your solution, reading some of O'Reilly's "Programming Perl" and stumbling upon some stones I managed to create attached scripts.

Thansk Again
Lixbie

1st Stone: Some files were *.pdf and others *.PDF
Code:

Quote:

#Change PDF to pdf in files
#!/usr/bin/perl #Use as ./ren1ch.pl *.PDF > ren.sh
print STDOUT "#!/usr/bin/sh\n";
while (<ARGV>) # While ($_ = <ARGV>)
{
($name) = $ARGV; #$name is name of first *.PDF file
$name =~ s/PDF/pdf/g; # Change Pdf to pdf in $name
print STDOUT "mv $ARGV $name\n\n"; # move old file to new file
close ARGV; # backto top
}

2nd Stone: The name of the file wasn't always on the 5th line but always started with a 12 character long text.
3rd: Large text neded to be abreviated
4th: Spaces had to be changed to _
5th: Found some non Word Characters in the text
6th: Some files where split and had the similar names so one was overriding the other and I was getting 90% of the original files. Added the counter and OK.
Code:

Quote:

#Rename all pdf files to $name acording to line in *.txt file
#!/usr/bin/perl #Use it as ./ren2ren *.txt > ren.sh
while (<ARGV>) # While ($_ = <ARGV>)
{
if (/texttext:/) # line contains my text:
{
($name) = substr($_, 12);
if ($name =~ /\s(.*)\b/)
{
$name = $1;
$name =~ s/[some large text]/\0/g; # Abreviate
$name =~ s/\W+/\_/g;
$name =~ s/_+/_/g;
($ARGV2)= $ARGV; # Keep ARGV... still testing
$ARGV2 =~ s/txt/pdf/g;
$AI++;
print STDOUT "cp $ARGV2 $name\_F$AI.pdf\n\n";
print STDOUT "mv $ARGV\t\t\txt/$ARGV\n";
print STDOUT "mv $ARGV2\t\t\pdf/$ARGV2\n\n";
close ARGV; #back to top
}
}
}

ghostdog74 · 06-30-2008, 10:10 PM

shell

Code:

for PDF in *.pdf
do
 txt="${PDF%*.pdf}.txt"
 if [ -e ${txt} ]; then
   fivechar=$(awk 'NR==2{print substr($0,1,5)}' "${txt}")
   mv "$PDF" "${fivechar}.pdf"   
   # add code to remove remaining txt files
 fi 
done

Quote:

Feel free to provide the complete code, but if you do so I won't have to read on how to use those commands and what those options do.

you don't have to read on how to use them? that's not the way to learn.

chrism01 · 06-30-2008, 10:17 PM

Looks like you're going with Perl, so bookmark this: http://perldoc.perl.org/

lixbie · 07-01-2008, 08:24 AM

Thank Gostdog 74 & Chrism01 for your reply.

Ghost: Is that bash shell? I don't understand clearly the purpose of the script but I guess that it changes the uppercase PDF for lowercase pdf? Right?

Chris: Thanks for the link, very good one. One thing I haven't been able to find is how to break long lines in perl without trouble because I would like to and tab subroutines to the right and keep all text in the screen and all my comments in one neat column; but when you have the following line tabbed three times

$name =~ s/[some very very very very large text]/\0/g; # Abreviate

the comment goes out of the screen to the right.

Is there a way to break this line in perl like this

$name =~ s/[some very very very
very large text]/\0/g; # Abreviate

ghostdog74 · 07-01-2008, 11:22 AM

Quote:

Originally Posted by lixbie

Thank Gostdog 74 & Chrism01 for your reply.

Ghost: Is that bash shell? I don't understand clearly the purpose of the script but I guess that it changes the uppercase PDF for lowercase pdf? Right?

yes its shell(bash) and no, it doesn't change upper case PDF to lowercase. it goes through all your pdf files, check for corresponding .txt file of the same name, get the 1st 5 chars of the second line of the corresponding text file and use it to rename the pdf file. If you are interested , see my sig for bash link

chrism01 · 07-01-2008, 08:12 PM

Well, you can store a string in a var, then interpolate into the regex. See the tutorial here: http://perldoc.perl.org/perlretut.html, specifically 'More on characters, strings, and character classes'.
To assign a long string in sections, use the '.' concat operator http://perldoc.perl.org/perlop.html#Constant-Folding

eg

Code:

$str= "asdfgh".
      "zxcvb";

# same as
$str="asdfghzxcvb";

PS always start your progs with

#!/usr/bin/perl -w
use strict;

Those 2 strictures ( -w = warnings & use strict) enforce proper coding eg declarations and warn you of dodgy/broken syntax.
To syntax test a perl prog use

>perl -wc prog.pl
which will do a test compile without running it.

lixbie · 07-02-2008, 09:56 AM

Thank you All for your help.

With your assistance and 40 hours of work I managed you produce the previously posted scripts. With them I could rename 24,000+ pdf files.

I didn't know that bash could also be used. Nevertheless it is still in my "to pursue" list.

Thanks Again
Lixbie