[SOLVED] A bash script to automatically copy all documents with directory structure
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
A bash script to automatically copy all documents with directory structure
Hi all,
I am going away for a little bit and I wanted to get out all the documents from my debian box. Here is the problem
many document types: pdf ps doc rtf txt sxw odt
must deal with nasty spaces in filenames: I do not want to change this on the debian box
with all the documents copied to a new single directory I want to know where they came from
The good news is, I have already done it, but quite poorly. I would like someone to comment on improvements to reduce this script and make it more elegant. There are 3 scripts:
1) find all docs. Since there are a few extensions, I tried to use a loop, but the find command didnt like the way i passed in the regular expression and extensions grrr.
2) create new name structure. I want the directory name to be in the filename, each '/' turns into a '-', this is because I use underscores in filenames alot.
Code:
#!/bin/bash
DAT=`cat alldocs | sed -e 's/ /_/g'`
for i in $DAT
do
j=$i
if [ `echo $i | grep "./"` ]
then
j=`echo $i | sed -e 's/\.\///g'`
fi
k=$j
if [ `echo $j | grep "/"` ]
then
k=`echo $j | sed -e 's/\//-/g'`
fi
echo $k
done
3) copy those documents with a new name. I basically load the original filenames into a an array and the new ones in another, the copy script goes through both arrays in parallel.
Code:
#!/bin/bash
# use '_1_' to substitute actual space, need this for the cp command to work
DAT=`cat alldocs | sed -e 's/ /_1_/g'`
DATN=`cat alldocs_names`
TDIR="/mnt/scratch/alldocs/"
X=0
for i in $DAT
do
AD[$X]=$i
let X=X+1
done
X=0
for i in $DATN
do
ADN[$X]=$i
let X=X+1
done
Y=0
while [ $Y -le $X ]
do
# echo "AD["$Y"]: " ${AD[$Y]}
# echo "ADN["$Y"]: " ${ADN[$Y]}
i=${AD[$Y]}
j=$i
# look for '_1_' translate back to '\ '
if [ `echo $i | grep "_1_"` ]
then
j=`echo $i | sed 's/_1_/\ /g'`
# echo "found space:: " $j
cp -v -f "$j" $TDIR""${ADN[$Y]}
else
cp -v -f $j $TDIR""${ADN[$Y]}
fi
let Y=Y+1
done
So thats it. I have over 2500 documents, so thats pretty much why everything is sequenced like this. Im not lazy and asking ppl to fix this, but i have little bash experience as I am mostly a C guy.
I hope to use this script for other extensions, like for music, video, images, web pages (harder) etc.
I think it would be better to organize the files according to type. There is no reason to keep ps, rtf, doc and pdf files separate. It does make sense to put media files in a different directory.
You could also add an -exec clause to copy the file as well as produce the list.
If the number of documents isn't too large, using arrays is OK. But the program will fail if the list gets too large.
Consider writing a function to normalize the filename. Then call it like
mv "$filename" "$(normalize "$filename")"
If you want to keep your alldocs file as a record, consider adding the date to the filename.
yeah this is a nice solution, but I didnt want the actual directories that go with it. rather just rename the file with the prefix being the directory path.
use find to build up your list of files to include and exclude, then you can review it first by eye.
something like tar -cvfX tar_file.tar exclude_file directory
it's in the man page.
I checked the man page, and I am not sure on how to use it. since I want an include option rather than exclude...but like above i didnt want the original directories.
can you comment on how you would use it though?
tar -cvfX tar_file.tar *.tex *.pdf --exclude=* directory
hey I was looking for such a thing. But I still want to keep the extensions definition outside, I was thinking about creating the argument for the find command, but couldnt get it going, its that bloody escape character:
if I run that command from the command line it works fine, but inside the script it doesnt work. where can I find out about the -fprint ? is that for find? bash? i found no man pages.
Quote:
Originally Posted by jschiwal
I think it would be better to organize the files according to type. There is no reason to keep ps, rtf, doc and pdf files separate. It does make sense to put media files in a different directory.
You could also add an -exec clause to copy the file as well as produce the list.
oh well, i do intend to pass an argument for target dir :P
Quote:
Originally Posted by jschiwal
If the number of documents isn't too large, using arrays is OK. But the program will fail if the list gets too large.
Consider writing a function to normalize the filename. Then call it like
mv "$filename" "$(normalize "$filename")"
If you want to keep your alldocs file as a record, consider adding the date to the filename.
Good Luck!
yes, unfortunately my nasty solution isnt so great, i might do an on the fly conversion like you mention, traverse file list, create name and copy. As for the timestamp, it doesnt bother me so much but I might use cp --preserve=timestamp
Unless your directory names contain actual information about the file, like being organized by author or topic, adding them to the filenames seems silly to me. This might be the case with things like web archives that have stupid names like page1.html or 12321.html.
If this were the case, I would use a single find command to provide the arguments to a tar command and backup these filetypes including the directory structures.
Your response came as I was typing in my last post. Some of it may seem repetitive.
Yes, my example was typed in the interactive shell. Add a "\" at the end of the line for a script. ( Of a line being split in the find command example )
That will cause the newline character to be ignored. The newlines were there simply to keep the line from getting too long. Plus, I find that for repetitive lines, lining them up vertically and adding spaces makes the lines resemble a table. That can make most typos stand out instantly.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.