LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   BASH script optimization for testing large number of files (http://www.linuxquestions.org/questions/programming-9/bash-script-optimization-for-testing-large-number-of-files-833045/)

instag 09-18-2010 10:29 AM

BASH script optimization for testing large number of files
 
Hello, I want to move files from a $SOURCEDIR to a $DESTBASE/$DESTDIR. Under $DESTBASE there are many directories, and I need to test beforehand if a file from $SOURCEDIR already exists in any of them.
The following works:
Code:

cd $SOURCEDIR
for i in *; do
  find $DESTBASE -type f | grep -q -F "$i"
  if [ $? != 0 ]; then
    mv "$i" "$DESTBASE"/"$DESTDIR"
  else
    echo "$i exists"
  fi
done

This is obviously extremely slow, and the real use case involves dozens of dirs and thousands of files. Creating a temporary "index" file for the find command (instead of running it every iteration) speeds it up a little, but it's still very clumsy. Input on how to do this more effectively would be much appreciated.

catkin 09-18-2010 10:39 AM

You could speed it up a little by using (not tested)
Code:

if [[ "$( find "$DESTBASE" -type f -name "$i" )" = '' ]]; then
    mv "$i" "$DESTBASE"/"$DESTDIR"
else
    echo "$i exists"
fi

.

druuna 09-18-2010 10:40 AM

Hi,

Did you have a look at mv's -u option? That could eliminate the whole test block.
Quote from the man page:
Quote:

-u, --update
move only when the SOURCE file is newer than the destination
file or when the destination file is missing

I overlooked something, see post #2 for a correct answer.

Hope this helps.

catkin 09-18-2010 10:43 AM

Quote:

Originally Posted by druuna (Post 4101806)
Hi,

Did you have a look at mv's -u option? That could eliminate the whole test block.
Quote from the man page:

Hope this helps.

That was my first take but the file may be in a subdir of $DESTBASE and the mv is into $DESTBASE itself.

druuna 09-18-2010 10:48 AM

Hi,
Quote:

Originally Posted by catkin (Post 4101808)
That was my first take but the file may be in a subdir of $DESTBASE and the mv is into $DESTBASE itself.

Had to think for a second about that, but you are correct!

@instag: disregard my answer, have a look at catkin's post #2.

kurumi 09-18-2010 11:34 AM

Code:

#!/usr/bin/env ruby -w

require 'fileutils'
sdir=File.join("/path","source")
ddir=File.join("/path","dest"
files_in_dest=Dir["#{ddir}/*"].select{|x| File.file?(x)}
Dir.chdir(sdir)
Dir["*"].each do |f|
  if !files_in_dest.include?(f) and File.file?(f)
    FileUtils.cp(f, ddir)
  end
end


instag 09-18-2010 12:26 PM

@catkin: Thanks, that looks cleaner and has one less test.

@kurumi: That looks interesting, though I'm using the above as part of a larger bash script.

konsolebox 09-18-2010 11:16 PM

Do you mean that you have many destination directories? If neither in those directories does the file exists, will you only move the file to the first one?

grail 09-19-2010 08:28 PM

ok ... I am a little confused (sometimes easily done). If I assume that the demo code is at least loosely based on the real code then are we not simply testing to see if
the file to be moved does or does not exist in the path $DESTBASE/$DESTDIR?
The find seems to imply that the file could be anywhere, including our destination, is this correct?

Anyhoo, I would make a slight change to catkin's script as if in bash is already a test in itself:
Code:

if find "$DESTBASE" -type f -name "$i" 2>&1 > /dev/null; then
    mv "$i" "$DESTBASE/$DESTDIR"
else
    echo "$i exists"
fi

Or of course if my assumption is correct that it only need not exists where we are moving it to:
Code:

if [[ -f "$DESTBASE/$DESTDIR/$i" ]]
then

    mv "$i" "$DESTBASE/$DESTDIR"
else
    echo "$i exists"
fi


ntubski 09-20-2010 08:03 AM

If the locate database has been updated you can do
Code:

if locate --limit 1 "$DESTBASE/*/$i" >/dev/null ; then
    mv "$i" "$DESTBASE"/"$DESTDIR"
else
    echo "$i exists"
fi

Since locate uses absolute paths you might need:
Code:

DESTBASE="$(readlink --canonicalize $DESTBASE)"

H_TeXMeX_H 09-20-2010 09:34 AM

Using locate is probably the best option. However, if you don't have it or don't want to use it, just use find to find all the files $DESTBASE and output that to a file, then use grep to find each file as you go.

Code:

find $DESTBASE -type f > list
grep file list

Technically you could do it even faster if you find all files in source and destination, clip them as needed, then compare them and find the files that should not be moved ... I guess I could write this up, but I'll leave it to you.

instag 09-20-2010 12:13 PM

Hey, thanks for the answers.
Quote:

The find seems to imply that the file could be anywhere, including our destination, is this correct?
That's exactly correct. I see that it would have been better to describe the actual use case instead of abstracting the problem. What I'm needing this for is copying photos from a camera to the user's collection on the hard drive, while avoiding to copy those which where copied previously (i.e. the user didn't delete the old photos in the camera).
Keeping the original variables as posted above, basically the script does the following:
  1. Copies all files from the camera to $SOURCEDIR
  2. Tags all the file's names with the date and the camera's numbering, so that they are unique (I use jhead (http://www.sentex.net/~mwandel/jhead/) for this).
  3. Checks which photos exist already under the user's $BASEDIR. That dir could be "/home/users/joe/MyPhotos", under which there are dirs like "My Dog", "The Kids", "Travels", "Incoming", etc.
  4. That last "Incoming" dir is actually $DESTDIR, where the new photos get copied to, and from where the user distributes them to his other dirs.
  5. Photos identified as already there get deleted from $SOURCEDIR (but not the camera memory, that's the user's task)
As long as the identifying tag stays in the photo name, the above works (the real script checks against the substring).

Quote:

output that to a file, then use grep to find each file
That's exactly what I'm doing.

Quote:

Technically you could do it even faster
Errm, that's what I'm looking for. Isn't there some cool, elegant way with arrays? But yes, a second temporary file for the positive hits is the way to go I guess.
BTW, I thought about an index with md5 hashes, would keep it even independent from file names, but quickly disregarded the idea as overkill.

konsolebox 09-20-2010 02:33 PM

So why not just do it as grail suggested?

Code:

cd "$SOURCEDIR"

for A in *; do
    if [[ ! -f $DESTBASE/$DESTDIR/$A ]]; then
        mv "$A" "$DESTBASE/$DESTDIR"
    else
        echo "$DESTBASE/$DESTDIR/$A already exists."
    fi
done

Are the files placed in multi-level directories that we have to use find?

H_TeXMeX_H 09-20-2010 03:27 PM

Quote:

Originally Posted by instag (Post 4103532)
That's exactly what I'm doing.

Are you sure, because by the look of it, find is being run for each file ... and find does take a while to run, especially for lots of files, I think that's the major slowdown. grep is much faster.

instag 09-20-2010 04:58 PM

Quote:

Originally Posted by H_TeXMeX_H (Post 4103728)
Are you sure, because by the look of it, find is being run for each file

Yes, as I mentioned in the OP I may use a temporary file, and that's what I did in the meantime. It's just that I thought that this is kind of clumsy, and was looking for something more "integrated". But of course it does work OK that way.


All times are GMT -5. The time now is 06:50 AM.