LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 09-18-2010, 11:29 AM   #1
instag
LQ Newbie
 
Registered: Sep 2010
Distribution: Slackware
Posts: 15

Rep: Reputation: 0
BASH script optimization for testing large number of files


Hello, I want to move files from a $SOURCEDIR to a $DESTBASE/$DESTDIR. Under $DESTBASE there are many directories, and I need to test beforehand if a file from $SOURCEDIR already exists in any of them.
The following works:
Code:
cd $SOURCEDIR
for i in *; do
  find $DESTBASE -type f | grep -q -F "$i"
  if [ $? != 0 ]; then
    mv "$i" "$DESTBASE"/"$DESTDIR"
  else
    echo "$i exists"
  fi
done
This is obviously extremely slow, and the real use case involves dozens of dirs and thousands of files. Creating a temporary "index" file for the find command (instead of running it every iteration) speeds it up a little, but it's still very clumsy. Input on how to do this more effectively would be much appreciated.
 
Old 09-18-2010, 11:39 AM   #2
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Servers: Debian Squeeze and Wheezy. Desktop: Slackware64 14.0. Netbook: Slackware 13.37
Posts: 8,557
Blog Entries: 28

Rep: Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178
You could speed it up a little by using (not tested)
Code:
if [[ "$( find "$DESTBASE" -type f -name "$i" )" = '' ]]; then
    mv "$i" "$DESTBASE"/"$DESTDIR"
else
    echo "$i exists"
fi
.
 
1 members found this post helpful.
Old 09-18-2010, 11:40 AM   #3
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374
Hi,

Did you have a look at mv's -u option? That could eliminate the whole test block.
Quote from the man page:
Quote:
-u, --update
move only when the SOURCE file is newer than the destination
file or when the destination file is missing

I overlooked something, see post #2 for a correct answer.

Hope this helps.

Last edited by druuna; 09-18-2010 at 11:49 AM. Reason: Overlooked something important.
 
Old 09-18-2010, 11:43 AM   #4
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Servers: Debian Squeeze and Wheezy. Desktop: Slackware64 14.0. Netbook: Slackware 13.37
Posts: 8,557
Blog Entries: 28

Rep: Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178Reputation: 1178
Quote:
Originally Posted by druuna View Post
Hi,

Did you have a look at mv's -u option? That could eliminate the whole test block.
Quote from the man page:

Hope this helps.
That was my first take but the file may be in a subdir of $DESTBASE and the mv is into $DESTBASE itself.
 
Old 09-18-2010, 11:48 AM   #5
druuna
LQ Veteran
 
Registered: Sep 2003
Posts: 10,532
Blog Entries: 7

Rep: Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374Reputation: 2374
Hi,
Quote:
Originally Posted by catkin View Post
That was my first take but the file may be in a subdir of $DESTBASE and the mv is into $DESTBASE itself.
Had to think for a second about that, but you are correct!

@instag: disregard my answer, have a look at catkin's post #2.
 
Old 09-18-2010, 12:34 PM   #6
kurumi
Member
 
Registered: Apr 2010
Posts: 223

Rep: Reputation: 45
Code:
#!/usr/bin/env ruby -w 

require 'fileutils'
sdir=File.join("/path","source")
ddir=File.join("/path","dest"
files_in_dest=Dir["#{ddir}/*"].select{|x| File.file?(x)}
Dir.chdir(sdir)
Dir["*"].each do |f|
  if !files_in_dest.include?(f) and File.file?(f)
    FileUtils.cp(f, ddir)
  end
end
 
Old 09-18-2010, 01:26 PM   #7
instag
LQ Newbie
 
Registered: Sep 2010
Distribution: Slackware
Posts: 15

Original Poster
Rep: Reputation: 0
@catkin: Thanks, that looks cleaner and has one less test.

@kurumi: That looks interesting, though I'm using the above as part of a larger bash script.
 
Old 09-19-2010, 12:16 AM   #8
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,245
Blog Entries: 15

Rep: Reputation: 233Reputation: 233Reputation: 233
Do you mean that you have many destination directories? If neither in those directories does the file exists, will you only move the file to the first one?
 
Old 09-19-2010, 09:28 PM   #9
grail
Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 7,627

Rep: Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947Reputation: 1947
ok ... I am a little confused (sometimes easily done). If I assume that the demo code is at least loosely based on the real code then are we not simply testing to see if
the file to be moved does or does not exist in the path $DESTBASE/$DESTDIR?
The find seems to imply that the file could be anywhere, including our destination, is this correct?

Anyhoo, I would make a slight change to catkin's script as if in bash is already a test in itself:
Code:
if find "$DESTBASE" -type f -name "$i" 2>&1 > /dev/null; then
    mv "$i" "$DESTBASE/$DESTDIR"
else
    echo "$i exists"
fi
Or of course if my assumption is correct that it only need not exists where we are moving it to:
Code:
if [[ -f "$DESTBASE/$DESTDIR/$i" ]]
then

    mv "$i" "$DESTBASE/$DESTDIR"
else
    echo "$i exists"
fi
 
1 members found this post helpful.
Old 09-20-2010, 09:03 AM   #10
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian
Posts: 2,517

Rep: Reputation: 856Reputation: 856Reputation: 856Reputation: 856Reputation: 856Reputation: 856Reputation: 856
If the locate database has been updated you can do
Code:
if locate --limit 1 "$DESTBASE/*/$i" >/dev/null ; then
    mv "$i" "$DESTBASE"/"$DESTDIR"
else
    echo "$i exists"
fi
Since locate uses absolute paths you might need:
Code:
DESTBASE="$(readlink --canonicalize $DESTBASE)"

Last edited by ntubski; 09-20-2010 at 09:04 AM. Reason: forgot >/dev/null
 
Old 09-20-2010, 10:34 AM   #11
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
Using locate is probably the best option. However, if you don't have it or don't want to use it, just use find to find all the files $DESTBASE and output that to a file, then use grep to find each file as you go.

Code:
find $DESTBASE -type f > list
grep file list
Technically you could do it even faster if you find all files in source and destination, clip them as needed, then compare them and find the files that should not be moved ... I guess I could write this up, but I'll leave it to you.
 
Old 09-20-2010, 01:13 PM   #12
instag
LQ Newbie
 
Registered: Sep 2010
Distribution: Slackware
Posts: 15

Original Poster
Rep: Reputation: 0
Hey, thanks for the answers.
Quote:
The find seems to imply that the file could be anywhere, including our destination, is this correct?
That's exactly correct. I see that it would have been better to describe the actual use case instead of abstracting the problem. What I'm needing this for is copying photos from a camera to the user's collection on the hard drive, while avoiding to copy those which where copied previously (i.e. the user didn't delete the old photos in the camera).
Keeping the original variables as posted above, basically the script does the following:
  1. Copies all files from the camera to $SOURCEDIR
  2. Tags all the file's names with the date and the camera's numbering, so that they are unique (I use jhead (http://www.sentex.net/~mwandel/jhead/) for this).
  3. Checks which photos exist already under the user's $BASEDIR. That dir could be "/home/users/joe/MyPhotos", under which there are dirs like "My Dog", "The Kids", "Travels", "Incoming", etc.
  4. That last "Incoming" dir is actually $DESTDIR, where the new photos get copied to, and from where the user distributes them to his other dirs.
  5. Photos identified as already there get deleted from $SOURCEDIR (but not the camera memory, that's the user's task)
As long as the identifying tag stays in the photo name, the above works (the real script checks against the substring).

Quote:
output that to a file, then use grep to find each file
That's exactly what I'm doing.

Quote:
Technically you could do it even faster
Errm, that's what I'm looking for. Isn't there some cool, elegant way with arrays? But yes, a second temporary file for the positive hits is the way to go I guess.
BTW, I thought about an index with md5 hashes, would keep it even independent from file names, but quickly disregarded the idea as overkill.
 
Old 09-20-2010, 03:33 PM   #13
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,245
Blog Entries: 15

Rep: Reputation: 233Reputation: 233Reputation: 233
So why not just do it as grail suggested?

Code:
cd "$SOURCEDIR"

for A in *; do
    if [[ ! -f $DESTBASE/$DESTDIR/$A ]]; then
        mv "$A" "$DESTBASE/$DESTDIR"
    else
        echo "$DESTBASE/$DESTDIR/$A already exists."
    fi
done
Are the files placed in multi-level directories that we have to use find?
 
Old 09-20-2010, 04:27 PM   #14
H_TeXMeX_H
Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269Reputation: 1269
Quote:
Originally Posted by instag View Post
That's exactly what I'm doing.
Are you sure, because by the look of it, find is being run for each file ... and find does take a while to run, especially for lots of files, I think that's the major slowdown. grep is much faster.
 
Old 09-20-2010, 05:58 PM   #15
instag
LQ Newbie
 
Registered: Sep 2010
Distribution: Slackware
Posts: 15

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by H_TeXMeX_H View Post
Are you sure, because by the look of it, find is being run for each file
Yes, as I mentioned in the OP I may use a temporary file, and that's what I did in the meantime. It's just that I thought that this is kind of clumsy, and was looking for something more "integrated". But of course it does work OK that way.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Bash Script for Moving X number files from /direct1 to /direct2? Supreme1012 Programming 14 01-30-2010 06:08 PM
Trouble with a script to manipulate files within a large number of directories zorblart Programming 1 01-10-2009 03:11 AM
ext3 performance -- very large number of files, large filesystems, etc. td3201 Linux - Server 5 11-25-2008 10:28 AM
commands for bash script that handles files of varying number of lines BBFeltham Linux - Newbie 1 07-26-2008 11:18 AM


All times are GMT -5. The time now is 01:47 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration