For loop from a certain point

Entropy1024 · 09-04-2018, 12:11 PM

I'm performing a test on some files in a directory. This test is run every minute and there are lots & lots of files in the directory, many more appearing as time goes on.

My script VERY simplified looks like this:

Code:

for f in $FILES
    do
       #Do stuff
    done

Problem is that as this folder grows ALL the files in the directory are processed each time, which is very wasteful on time and resources.
What I would like to do is get it to start from where it last left off.

How can I go about this goal?

Thanks for any help.

MensaWater · 09-04-2018, 12:24 PM

Is there some progression in the file names (e.g. number or date increment)?

e.g.
file1
file2
file3
...
file2900000

or:
file201809040701
file201809040721
file201809040741
...
file201903312301

If so you could base it on that naming progression. Otherwise you could do it based on a "find" output based on creation time or access time. Save that time from one run into a file and have the next run read that saved time as the basis for its time.

Entropy1024 · 09-04-2018, 12:42 PM

Quote:

Originally Posted by MensaWater

Is there some progression in the file names (e.g. number or date increment)?

e.g.
file1
file2
file3
...
file2900000

or:
file201809040701
file201809040721
file201809040741
...
file201903312301

If so you could base it on that naming progression. Otherwise you could do it based on a "find" output based on creation time or access time. Save that time from one run into a file and have the next run read that saved time as the basis for its time.

Yes the files all have date and time in them like this:
DomeCCTV_216576543_20180904183710574_MOTION_DETECTION.jpg

I know I can find the last image using:
LASTIMAGE=$(ls | tail -1)
But don't know how to pass this to the loop to start from that file.

rtmistler · 09-04-2018, 12:47 PM

Moving this to the Programming forum to gain it better exposure.

Perhaps using date or last modified on flag.

But it depends on how the FILES list is constructed.

Memory of when the script was last run could be as simple as a specific file, visible or hidden, which you touch each time to pinpoint the last time+day when you've processed the directory.

Beryllos · 09-04-2018, 03:20 PM

I have done something like that in bash, by storing a list of already-processed files and using diff to find the new files.

Example:

Code:

# list all files in directory which match the pattern
ls *.jpg > allfiles.txt

# create oldfiles.txt if it does not already exist
touch oldfiles.txt

# find files in allfiles.txt which are not in oldfiles.txt, create newfiles.txt
diff allfiles.txt oldfiles.txt | grep "<" | cut -d " " -f 2 > newfiles.txt

# process list
for item in $(cat newfiles.txt)
do
    # do stuff
    
    # append filename to oldfiles.txt
    echo $item >> oldfiles.txt
done

# sort oldfiles.txt (necessary if files are created out of ls order)
sort oldfiles.txt > temp.txt
mv temp.txt oldfiles.txt

Beryllos · 09-04-2018, 03:45 PM

Since you are running the script every minute, you may wonder what happens if the run time is more than one minute. This could happen if there are many new files since the last script execution. You would end up with two or more simultaneously running scripts, which would duplicate the processing and leave duplicate entries in the list of old files. There is even a small chance it could garble the file lists. Not good!

The way I avoid this is by running a test at the beginning of the script:

Example:

Code:

# check and exit if task is already running

# extract script name (the bit after the last slash character)
thiscom=$(echo $0 | rev | cut -d'/' -f 1 | rev)

# count processes associated with that name (plus one for the final newline, apparently)
if [ $(ps -C "$thiscom" -o comm= | wc -l) -gt 2 ]
then
    # exit due to duplicate process
    exit
fi

# here start the file listing and processing...

Entropy1024 · 09-04-2018, 04:20 PM

Originally I had the FILES list created by using:
FILES=/home/tim/Videos/domecctv/$DATENOW/*.jpg

Now I'm using:
FILES=$(ls /home/tim/Videos/domecctv/$DATENOW/*.jpg | tail -40)

Therefore instead of running through the entire list of possible 100+ images it just reads the last 40 every minute.
Not perfect but more efficient.

lougavulin · 09-04-2018, 05:12 PM

Well, there is also something like this :

Code:

for f in $(find /home/tim/Videos/domecctv/ -newer ${HOME}/tmp/videos.ref -print)
do
  #Do stuff
done
touch ${HOME}/tmp/videos.ref

Don't have to be ${HOME}/tmp/videos.ref, choose a place and filename convenient.

The first time will take every files, so you should probably do a touch before running it :

Code:

touch -t <timestamp 2 minutes before> ${HOME}/tmp/videos.ref

It is another way process only new files as Beryllos in post #5 with diff.

MadeInGermany · 09-05-2018, 09:41 PM

A shorter and safer way of the post #5 (allows special characters in file names):

Code:

donefile=oldfiles.txt
for item in *.jpg
do
    # skip none-files and the done files
    if [ ! -f "$item" ] || fgrep -qx "$item" $donefile
    then
        echo "skipping $item"
        continue
    fi
    # do stuff
    
    # append filename to done files
    echo "$item" >> $donefile
done

Beryllos · 09-06-2018, 10:36 AM

Looks good. Thanks for the several improvements.

Field95 · 09-08-2018, 08:46 PM

Thought I'd take a go at this with python.
It monitors the directories you specify by checking the directories modification date, then checking if the file has already been processed in that directory.
If something has changed, it'll run the command on that file. So something like

Code:

./directory_monitor.py "echo -e .\t" foobar
./directory_monitor.py script.sh foobar

with script.sh containing:

Code:

echo -e ".\t" "$1"

would do this:

Code:

.    foobar/file1
.    foobar/file2
.    ...

You can run it in a daemon mode with -d or --daemon where it simply checks every second if something happened.

Code:

usage: monitor_directory.py [-h] [-d] [--temp-file TEMP_FILE] [--trim-cache]
                            [--include INCLUDE] [--exclude EXCLUDE]
                            command [directories [directories ...]]

Monitors directory for changes and runs command on new files

positional arguments:
  command               Run command or script on each file: ./script
                        file_foobar
  directories

optional arguments:
  -h, --help            show this help message and exit
  -d, --daemon
  --temp-file TEMP_FILE
                        Location of cache file, default /tmp
  --trim-cache          Remove irrelevant directories from the cache
  --include INCLUDE     glob file matching, can be invoked multiple times
  --exclude EXCLUDE     glob file matching, can be invoked multiple times

monitor_directory.py

Code:

#!/usr/bin/env python3

import argparse
import pathlib
import json
import os
import subprocess
import tempfile
import time


def main():
    parser = argparse.ArgumentParser(description="Monitors directory for changes and runs command on new files")
    parser.add_argument("-d", "--daemon",
                        action="store_true",
                        )
    parser.add_argument("--temp-file",
                        default=os.path.join(tempfile.gettempdir(), "monitor.tmp"),
                        help="Location of cache file, default {}".format(tempfile.gettempdir())
                        )
    parser.add_argument("--trim-cache",
                        action="store_true",
                        help="Remove irrelevant directories from the cache")
    parser.add_argument("--include",
                        action="append",
                        help="glob file matching, can be invoked multiple times",
                        )
    parser.add_argument("--exclude",
                        action="append",
                        help="glob file matching, can be invoked multiple times",
                        )
    parser.add_argument("command",
                        nargs=1,
                        help="Run command or script on each file: ./script file_foobar",
                        )
    parser.add_argument("directories",
                        default=[os.getcwd()],
                        nargs='*',
                        )
    args = parser.parse_args()

    # If is file, treat it as a script. If not, run as a command
    # Get scripts absolute path
    if os.path.isfile(os.path.abspath(args.command[0])):
        command = os.path.abspath(args.command[0])
    else:
        command = args.command[0]

    temp_file = args.temp_file
    args.directories = [os.path.abspath(directory) for directory in args.directories]

    # Manage loading of cache_directories.
    # Used to know when things have been processed or not.
    # If it doesn't exist, create one
    if os.path.isfile(temp_file):
        with open(temp_file) as f:
            cached_directories = json.load(f)

            # --trim option
            # If the directory isn't specified and is in the cache, remove it
            if args.trim_cache is True:
                non_existing_directories = list()
                for directory in cached_directories:
                    if directory not in args.directories:
                        non_existing_directories.append(directory)
                for directory in non_existing_directories:
                    cached_directories.pop(directory)
    else:
        cached_directories = dict()

    # Decide to run as daemon or script.
    # Having updated values will trigger a write to the temporary file
    if args.daemon is True:
        while True:
            try:
                result_cached_directories = process_files_command(command,
                                                                  cached_directories,
                                                                  args.directories,
                                                                  include=args.include,
                                                                  exclude=args.exclude)
                if result_cached_directories is True:
                    write_json_file(cached_directories, temp_file)
                time.sleep(1)
            except KeyboardInterrupt:
                write_json_file(cached_directories, temp_file)
                break
    else:
        result_cached_directories = process_files_command(command,
                                                          cached_directories,
                                                          args.directories,
                                                          include=args.include,
                                                          exclude=args.exclude)
        if result_cached_directories is True:
            write_json_file(cached_directories, temp_file)


def process_files_command(command, cache_dictionary, directories, include=None, exclude=None):
    def run_command(directory, file, command):
        subprocess.run(command.split(" ") + [os.path.join(directory, file)])
        cache_dictionary[directory][1].add(file)

    for directory in cache_dictionary:
        # Convert list of processed files to a set
        cache_dictionary[directory][1] = set(cache_dictionary[directory][1])
    for directory in directories:
        if directory in cache_dictionary:
            cached_directory_time = cache_dictionary[directory][0]
            current_directory_time = os.stat(directory).st_mtime
            if current_directory_time > cached_directory_time:
                if include or exclude:
                    directory_files = file_include_exclude(directory=directory, include=include, exclude=exclude)
                else:
                    directory_files = (file
                                       for file in os.listdir(directory)
                                       if os.path.isfile(os.path.join(directory, file)))
                # Check to see if the file is in the cache.
                # Ignore if so.
                for file in directory_files:
                    if file not in cache_dictionary[directory][1]:
                        run_command(directory, file, command)
                        cache_dictionary[directory][1].add(file)
                cache_dictionary[directory][0] = current_directory_time
                # Returning True indicates a change occurred. So the cache file should be written.
                return True
        else:
            cache_dictionary[directory] = [os.stat(directory).st_mtime, set()]
            if include or exclude:
                directory_files = file_include_exclude(directory=directory, include=include, exclude=exclude)
            else:
                directory_files = (file
                                   for file in os.listdir(directory)
                                   if os.path.isfile(os.path.join(directory, file)))
            for file in directory_files:
                run_command(directory, file, command)
            return True


def write_json_file(dictionary, temp_file):
    cached_directories = dictionary
    with open(temp_file, "w") as temp_file_write_object:
        for directory in cached_directories:
            # Convert list of processed files to a set
            cached_directories[directory][1] = list(cached_directories[directory][1])
        json.dump(cached_directories, temp_file_write_object, indent=4, sort_keys=True)


def file_include_exclude(*, directory, include, exclude):
    files = [file for file in os.listdir(directory)]
    if include:
        included_filenames = {file for glob_match in include
                              for file in files
                              if pathlib.PurePath(file).match(glob_match)}
    else:
        included_filenames = set()
    if exclude:
        excluded_filenames = {file for glob_match in exclude
                              for file in files
                              if pathlib.PurePath(file).match(glob_match)}
    else:
        excluded_filenames = set()

    for file in files:
        if file in included_filenames:
            yield file
        elif file not in excluded_filenames and len(excluded_filenames) > 0:
            yield file


if __name__ == '__main__':
    main()

I'll paste source link when I'm not to be flagged for posting links