LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 11-29-2007, 01:51 AM   #1
thelordmule
LQ Newbie
 
Registered: Jul 2006
Location: Australia
Distribution: Mac OSX 10.6, Ubuntu 10.10
Posts: 23

Rep: Reputation: 0
Smile A bash script to automatically copy all documents with directory structure


Hi all,

I am going away for a little bit and I wanted to get out all the documents from my debian box. Here is the problem

many document types: pdf ps doc rtf txt sxw odt
must deal with nasty spaces in filenames: I do not want to change this on the debian box
with all the documents copied to a new single directory I want to know where they came from

The good news is, I have already done it, but quite poorly. I would like someone to comment on improvements to reduce this script and make it more elegant. There are 3 scripts:

1) find all docs. Since there are a few extensions, I tried to use a loop, but the find command didnt like the way i passed in the regular expression and extensions grrr.

Code:
#!/bin/bash

EXTENSIONS="pdf ps doc rtf txt sxw odt tex"

echo "" > alldocs

# gah this did not work 
for i in $EXTENSIONS
do
	echo '*."$i"'
	EX="\'*.pls\'"
#	find -name '*.pls'
#	find -name $EX
#	find -name "'*.pls'"
#	find -name '*."$i"'# > "s_the"$i
#	cat "s_the"$i >> s_alldocs
done

# yuk!
echo "pdf"
find -name '*.pdf' > z_thepdf
echo "ps"
find -name '*.ps' > z_theps
echo "doc"
find -name '*.doc' > z_thedoc
echo "rtf"
find -name '*.rtf' > z_thertf
echo "txt"
find -name '*.txt' > z_thetxt
echo "sxw"
find -name '*.sxw' > z_thesxw
echo "odt"
find -name '*.odt' > z_theodt
echo "tex"
find -name '*.tex' > z_thetex

cat z_the* > alldocs

#call another script
./new_name_list > alldocs_names
2) create new name structure. I want the directory name to be in the filename, each '/' turns into a '-', this is because I use underscores in filenames alot.

Code:
#!/bin/bash

DAT=`cat alldocs | sed -e 's/ /_/g'`

for i in $DAT
do
	j=$i
	if [ `echo $i | grep "./"` ]
	then
		j=`echo $i | sed -e 's/\.\///g'`
	fi

	k=$j
	if [ `echo $j | grep "/"` ]
	then
		k=`echo $j | sed -e 's/\//-/g'`
	fi
	echo $k

done
3) copy those documents with a new name. I basically load the original filenames into a an array and the new ones in another, the copy script goes through both arrays in parallel.

Code:
#!/bin/bash

# use '_1_' to substitute actual space, need this for the cp command to work
DAT=`cat alldocs | sed -e 's/ /_1_/g'`
DATN=`cat alldocs_names`

TDIR="/mnt/scratch/alldocs/"

X=0
for i in $DAT
do
	AD[$X]=$i
	let X=X+1
done

X=0
for i in $DATN
do
	ADN[$X]=$i
	let X=X+1
done

Y=0
while [ $Y -le $X ]
do

#	echo "AD["$Y"]:  " ${AD[$Y]}
#	echo "ADN["$Y"]: " ${ADN[$Y]}

	i=${AD[$Y]}
	j=$i
	# look for '_1_' translate back to '\ '
	if [ `echo $i | grep "_1_"` ]
	then
		j=`echo $i | sed 's/_1_/\ /g'`
#		echo "found space:: " $j
		cp -v -f "$j" $TDIR""${ADN[$Y]}
	else
		cp -v -f $j $TDIR""${ADN[$Y]}
	fi

	let Y=Y+1
done
So thats it. I have over 2500 documents, so thats pretty much why everything is sequenced like this. Im not lazy and asking ppl to fix this, but i have little bash experience as I am mostly a C guy.

I hope to use this script for other extensions, like for music, video, images, web pages (harder) etc.

cheers
 
Old 11-29-2007, 02:18 AM   #2
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 63
You can do it all in a single rsync command:
Code:
rsync -avz --include '*/' \
    --include '*.pdf' \
    --include '*.ps' \
    --include '*.doc' \
    --include '*.rtf' \
    --include '*.txt' \
    --include '*.swx' \
    --include '*.odt' \
    --include '*.tex' \
    --exclude '*' \
    src/ dest/
 
Old 11-29-2007, 03:01 AM   #3
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: FreeBSD, Debian, Mint, Puppy
Posts: 3,287

Rep: Reputation: 173Reputation: 173
or you can use tar with eXclude or Include files.

use find to build up your list of files to include and exclude, then you can review it first by eye.



something like tar -cvfX tar_file.tar exclude_file directory
it's in the man page.
 
Old 11-29-2007, 03:32 AM   #4
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
Code:
`cat alldocs | sed -e 's/ /_/g'`
Here is something I see a lot on this site. There is no reason for the cat command.
Code:
`sed -e 's/ /_/g alldocs'`
You can use the -fprint command in the find command. It is also more efficient to use one find command instead of repeating the search:
Code:
find  \(-name '*.pdf' -fprintf pdf_files \) -o
      \(-name '*.ps'  -fprintf ps_files  \) -o
      \(-name '*.rtf' -fprintf rtf_files \) -o
...
      \(-name '*.tex' -printf tex_files \)
I think it would be better to organize the files according to type. There is no reason to keep ps, rtf, doc and pdf files separate. It does make sense to put media files in a different directory.
You could also add an -exec clause to copy the file as well as produce the list.

If the number of documents isn't too large, using arrays is OK. But the program will fail if the list gets too large.

Consider writing a function to normalize the filename. Then call it like
mv "$filename" "$(normalize "$filename")"

If you want to keep your alldocs file as a record, consider adding the date to the filename.

Good Luck!

Last edited by jschiwal; 11-29-2007 at 03:34 AM.
 
Old 11-30-2007, 01:13 AM   #5
thelordmule
LQ Newbie
 
Registered: Jul 2006
Location: Australia
Distribution: Mac OSX 10.6, Ubuntu 10.10
Posts: 23

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by matthewg42 View Post
You can do it all in a single rsync command:
Code:
rsync -avz --include '*/' \
    --include '*.pdf' \
    --include '*.ps' \
    --include '*.doc' \
    --include '*.rtf' \
    --include '*.txt' \
    --include '*.swx' \
    --include '*.odt' \
    --include '*.tex' \
    --exclude '*' \
    src/ dest/
yeah this is a nice solution, but I didnt want the actual directories that go with it. rather just rename the file with the prefix being the directory path.

very nice though
 
Old 11-30-2007, 01:19 AM   #6
thelordmule
LQ Newbie
 
Registered: Jul 2006
Location: Australia
Distribution: Mac OSX 10.6, Ubuntu 10.10
Posts: 23

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by bigearsbilly View Post
or you can use tar with eXclude or Include files.

use find to build up your list of files to include and exclude, then you can review it first by eye.



something like tar -cvfX tar_file.tar exclude_file directory
it's in the man page.
I checked the man page, and I am not sure on how to use it. since I want an include option rather than exclude...but like above i didnt want the original directories.

can you comment on how you would use it though?

tar -cvfX tar_file.tar *.tex *.pdf --exclude=* directory

cheers
 
Old 11-30-2007, 02:02 AM   #7
thelordmule
LQ Newbie
 
Registered: Jul 2006
Location: Australia
Distribution: Mac OSX 10.6, Ubuntu 10.10
Posts: 23

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by jschiwal View Post
Code:
`cat alldocs | sed -e 's/ /_/g'`
Here is something I see a lot on this site. There is no reason for the cat command.
Code:
`sed -e 's/ /_/g alldocs'`
good point, i originally had `cat a b c d | sed ...`

Quote:
Originally Posted by jschiwal View Post
You can use the -fprint command in the find command. It is also more efficient to use one find command instead of repeating the search:
Code:
find  \(-name '*.pdf' -fprintf pdf_files \) -o
      \(-name '*.ps'  -fprintf ps_files  \) -o
      \(-name '*.rtf' -fprintf rtf_files \) -o
...
      \(-name '*.tex' -printf tex_files \)
hey I was looking for such a thing. But I still want to keep the extensions definition outside, I was thinking about creating the argument for the find command, but couldnt get it going, its that bloody escape character:
Code:
FARG=" -name '*.tex' -fprintf tex_files "
find $FARG
echo "find "$FARG

output:
find  -name '*.tex' -fprint tex_files
if I run that command from the command line it works fine, but inside the script it doesnt work. where can I find out about the -fprint ? is that for find? bash? i found no man pages.

Quote:
Originally Posted by jschiwal View Post
I think it would be better to organize the files according to type. There is no reason to keep ps, rtf, doc and pdf files separate. It does make sense to put media files in a different directory.
You could also add an -exec clause to copy the file as well as produce the list.
oh well, i do intend to pass an argument for target dir :P

Quote:
Originally Posted by jschiwal View Post
If the number of documents isn't too large, using arrays is OK. But the program will fail if the list gets too large.

Consider writing a function to normalize the filename. Then call it like
mv "$filename" "$(normalize "$filename")"

If you want to keep your alldocs file as a record, consider adding the date to the filename.

Good Luck!
yes, unfortunately my nasty solution isnt so great, i might do an on the fly conversion like you mention, traverse file list, create name and copy. As for the timestamp, it doesnt bother me so much but I might use cp --preserve=timestamp

Thanks for the feedback
 
Old 11-30-2007, 02:23 AM   #8
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
Code:
find  \(-name '*.pdf' -fprintf pdf_files \) -o
      \(-name '*.ps'  -fprintf ps_files  \) -o
      \(-name '*.rtf' -fprintf rtf_files \) -o
...
      \(-name '*.tex' -printf tex_files \)

for file in $(cat *_files); do
    $destfile="${file#\./}"
    $destfile="${destfile//\//-}"
    mv "${file}" "${destdir}/${destfile}"
done
Unless your directory names contain actual information about the file, like being organized by author or topic, adding them to the filenames seems silly to me. This might be the case with things like web archives that have stupid names like page1.html or 12321.html.

If this were the case, I would use a single find command to provide the arguments to a tar command and backup these filetypes including the directory structures.

For example:
Code:
destdir=/mnt/hpmedia/Documents
jschiwal@hpamd64:~/Documents> find ./ -type f \( -iname "*.pdf", -iname "*.ps" , -iname "*.doc" \) -print0 | xargs -0 tar xzf ${destdir}/backup.tar.gz
Here is a two-liner solution that creates a tarball backup of certain filetypes.
 
Old 11-30-2007, 02:30 AM   #9
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654Reputation: 654
Your response came as I was typing in my last post. Some of it may seem repetitive.

Yes, my example was typed in the interactive shell. Add a "\" at the end of the line for a script. ( Of a line being split in the find command example )
That will cause the newline character to be ignored. The newlines were there simply to keep the line from getting too long. Plus, I find that for repetitive lines, lining them up vertically and adding spaces makes the lines resemble a table. That can make most typos stand out instantly.

Last edited by jschiwal; 11-30-2007 at 02:33 AM.
 
  


Reply

Tags
backup, bash, programming


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Copy Directory Structure Only ronin1 Linux - Newbie 10 06-08-2013 06:23 AM
How to copy an entire directory structure except certain files? thanhvn Programming 9 01-27-2012 11:41 AM
bash script to copy files thtr2k Programming 1 02-08-2007 12:03 AM
Bash script to strip a certain directory out of directories in a directory? rylan76 Linux - General 3 08-29-2006 11:35 AM
Copy directory structure? tpe Programming 2 06-02-2005 04:59 AM


All times are GMT -5. The time now is 12:37 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration