[SOLVED] A bash script to automatically copy all documents with directory structure

thelordmule · 11-29-2007, 01:51 AM

Hi all,

I am going away for a little bit and I wanted to get out all the documents from my debian box. Here is the problem

many document types: pdf ps doc rtf txt sxw odt
must deal with nasty spaces in filenames: I do not want to change this on the debian box
with all the documents copied to a new single directory I want to know where they came from

The good news is, I have already done it, but quite poorly. I would like someone to comment on improvements to reduce this script and make it more elegant. There are 3 scripts:

1) find all docs. Since there are a few extensions, I tried to use a loop, but the find command didnt like the way i passed in the regular expression and extensions grrr.

Code:

#!/bin/bash

EXTENSIONS="pdf ps doc rtf txt sxw odt tex"

echo "" > alldocs

# gah this did not work 
for i in $EXTENSIONS
do
	echo '*."$i"'
	EX="\'*.pls\'"
#	find -name '*.pls'
#	find -name $EX
#	find -name "'*.pls'"
#	find -name '*."$i"'# > "s_the"$i
#	cat "s_the"$i >> s_alldocs
done

# yuk!
echo "pdf"
find -name '*.pdf' > z_thepdf
echo "ps"
find -name '*.ps' > z_theps
echo "doc"
find -name '*.doc' > z_thedoc
echo "rtf"
find -name '*.rtf' > z_thertf
echo "txt"
find -name '*.txt' > z_thetxt
echo "sxw"
find -name '*.sxw' > z_thesxw
echo "odt"
find -name '*.odt' > z_theodt
echo "tex"
find -name '*.tex' > z_thetex

cat z_the* > alldocs

#call another script
./new_name_list > alldocs_names

2) create new name structure. I want the directory name to be in the filename, each '/' turns into a '-', this is because I use underscores in filenames alot.

Code:

#!/bin/bash

DAT=`cat alldocs | sed -e 's/ /_/g'`

for i in $DAT
do
	j=$i
	if [ `echo $i | grep "./"` ]
	then
		j=`echo $i | sed -e 's/\.\///g'`
	fi

	k=$j
	if [ `echo $j | grep "/"` ]
	then
		k=`echo $j | sed -e 's/\//-/g'`
	fi
	echo $k

done

3) copy those documents with a new name. I basically load the original filenames into a an array and the new ones in another, the copy script goes through both arrays in parallel.

Code:

#!/bin/bash

# use '_1_' to substitute actual space, need this for the cp command to work
DAT=`cat alldocs | sed -e 's/ /_1_/g'`
DATN=`cat alldocs_names`

TDIR="/mnt/scratch/alldocs/"

X=0
for i in $DAT
do
	AD[$X]=$i
	let X=X+1
done

X=0
for i in $DATN
do
	ADN[$X]=$i
	let X=X+1
done

Y=0
while [ $Y -le $X ]
do

#	echo "AD["$Y"]:  " ${AD[$Y]}
#	echo "ADN["$Y"]: " ${ADN[$Y]}

	i=${AD[$Y]}
	j=$i
	# look for '_1_' translate back to '\ '
	if [ `echo $i | grep "_1_"` ]
	then
		j=`echo $i | sed 's/_1_/\ /g'`
#		echo "found space:: " $j
		cp -v -f "$j" $TDIR""${ADN[$Y]}
	else
		cp -v -f $j $TDIR""${ADN[$Y]}
	fi

	let Y=Y+1
done

So thats it. I have over 2500 documents, so thats pretty much why everything is sequenced like this. Im not lazy and asking ppl to fix this, but i have little bash experience as I am mostly a C guy.

I hope to use this script for other extensions, like for music, video, images, web pages (harder) etc.

cheers

matthewg42 · 11-29-2007, 02:18 AM

You can do it all in a single rsync command:

Code:

rsync -avz --include '*/' \
    --include '*.pdf' \
    --include '*.ps' \
    --include '*.doc' \
    --include '*.rtf' \
    --include '*.txt' \
    --include '*.swx' \
    --include '*.odt' \
    --include '*.tex' \
    --exclude '*' \
    src/ dest/

bigearsbilly · 11-29-2007, 03:01 AM

or you can use tar with eXclude or Include files.

use find to build up your list of files to include and exclude, then you can review it first by eye.

something like tar -cvfX tar_file.tar exclude_file directory
it's in the man page.

jschiwal · 11-29-2007, 03:32 AM

Code:

`cat alldocs | sed -e 's/ /_/g'`

Here is something I see a lot on this site. There is no reason for the cat command.

Code:

`sed -e 's/ /_/g alldocs'`

You can use the -fprint command in the find command. It is also more efficient to use one find command instead of repeating the search:

Code:

find  \(-name '*.pdf' -fprintf pdf_files \) -o
      \(-name '*.ps'  -fprintf ps_files  \) -o
      \(-name '*.rtf' -fprintf rtf_files \) -o
...
      \(-name '*.tex' -printf tex_files \)

I think it would be better to organize the files according to type. There is no reason to keep ps, rtf, doc and pdf files separate. It does make sense to put media files in a different directory.
You could also add an -exec clause to copy the file as well as produce the list.

If the number of documents isn't too large, using arrays is OK. But the program will fail if the list gets too large.

Consider writing a function to normalize the filename. Then call it like
mv "$filename" "$(normalize "$filename")"

If you want to keep your alldocs file as a record, consider adding the date to the filename.

Good Luck!

thelordmule · 11-30-2007, 01:13 AM

Quote:

Originally Posted by matthewg42

You can do it all in a single rsync command:

Code:

rsync -avz --include '*/' \
    --include '*.pdf' \
    --include '*.ps' \
    --include '*.doc' \
    --include '*.rtf' \
    --include '*.txt' \
    --include '*.swx' \
    --include '*.odt' \
    --include '*.tex' \
    --exclude '*' \
    src/ dest/

yeah this is a nice solution, but I didnt want the actual directories that go with it. rather just rename the file with the prefix being the directory path.

very nice though

thelordmule · 11-30-2007, 01:19 AM

Quote:

Originally Posted by bigearsbilly

or you can use tar with eXclude or Include files.

use find to build up your list of files to include and exclude, then you can review it first by eye.

something like tar -cvfX tar_file.tar exclude_file directory
it's in the man page.

I checked the man page, and I am not sure on how to use it. since I want an include option rather than exclude...but like above i didnt want the original directories.

can you comment on how you would use it though?

tar -cvfX tar_file.tar *.tex *.pdf --exclude=* directory

cheers

thelordmule · 11-30-2007, 02:02 AM

Quote:

Originally Posted by jschiwal

Code:

`cat alldocs | sed -e 's/ /_/g'`

Here is something I see a lot on this site. There is no reason for the cat command.

Code:

`sed -e 's/ /_/g alldocs'`

good point, i originally had `cat a b c d | sed ...`

Quote:

Originally Posted by jschiwal

You can use the -fprint command in the find command. It is also more efficient to use one find command instead of repeating the search:

Code:

find  \(-name '*.pdf' -fprintf pdf_files \) -o
      \(-name '*.ps'  -fprintf ps_files  \) -o
      \(-name '*.rtf' -fprintf rtf_files \) -o
...
      \(-name '*.tex' -printf tex_files \)

hey I was looking for such a thing. But I still want to keep the extensions definition outside, I was thinking about creating the argument for the find command, but couldnt get it going, its that bloody escape character:

Code:

FARG=" -name '*.tex' -fprintf tex_files "
find $FARG
echo "find "$FARG

output:
find  -name '*.tex' -fprint tex_files

if I run that command from the command line it works fine, but inside the script it doesnt work. where can I find out about the -fprint ? is that for find? bash? i found no man pages.

Quote:

Originally Posted by jschiwal

I think it would be better to organize the files according to type. There is no reason to keep ps, rtf, doc and pdf files separate. It does make sense to put media files in a different directory.
You could also add an -exec clause to copy the file as well as produce the list.

oh well, i do intend to pass an argument for target dir :P

Quote:

Originally Posted by jschiwal

If the number of documents isn't too large, using arrays is OK. But the program will fail if the list gets too large.

Consider writing a function to normalize the filename. Then call it like
mv "$filename" "$(normalize "$filename")"

If you want to keep your alldocs file as a record, consider adding the date to the filename.

Good Luck!

yes, unfortunately my nasty solution isnt so great, i might do an on the fly conversion like you mention, traverse file list, create name and copy. As for the timestamp, it doesnt bother me so much but I might use cp --preserve=timestamp

Thanks for the feedback

jschiwal · 11-30-2007, 02:23 AM

Code:

find  \(-name '*.pdf' -fprintf pdf_files \) -o
      \(-name '*.ps'  -fprintf ps_files  \) -o
      \(-name '*.rtf' -fprintf rtf_files \) -o
...
      \(-name '*.tex' -printf tex_files \)

for file in $(cat *_files); do
    $destfile="${file#\./}"
    $destfile="${destfile//\//-}"
    mv "${file}" "${destdir}/${destfile}"
done

Unless your directory names contain actual information about the file, like being organized by author or topic, adding them to the filenames seems silly to me. This might be the case with things like web archives that have stupid names like page1.html or 12321.html.

If this were the case, I would use a single find command to provide the arguments to a tar command and backup these filetypes including the directory structures.

For example:

Code:

destdir=/mnt/hpmedia/Documents
jschiwal@hpamd64:~/Documents> find ./ -type f \( -iname "*.pdf", -iname "*.ps" , -iname "*.doc" \) -print0 | xargs -0 tar xzf ${destdir}/backup.tar.gz

Here is a two-liner solution that creates a tarball backup of certain filetypes.

jschiwal · 11-30-2007, 02:30 AM

Your response came as I was typing in my last post. Some of it may seem repetitive.

Yes, my example was typed in the interactive shell. Add a "\" at the end of the line for a script. ( Of a line being split in the find command example )
That will cause the newline character to be ignored. The newlines were there simply to keep the line from getting too long. Plus, I find that for repetitive lines, lining them up vertically and adding spaces makes the lines resemble a table. That can make most typos stand out instantly.