ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I'm performing a test on some files in a directory. This test is run every minute and there are lots & lots of files in the directory, many more appearing as time goes on.
My script VERY simplified looks like this:
Code:
for f in $FILES
do
#Do stuff
done
Problem is that as this folder grows ALL the files in the directory are processed each time, which is very wasteful on time and resources.
What I would like to do is get it to start from where it last left off.
How can I go about this goal?
Thanks for any help.
Last edited by Entropy1024; 09-04-2018 at 12:18 PM.
If so you could base it on that naming progression. Otherwise you could do it based on a "find" output based on creation time or access time. Save that time from one run into a file and have the next run read that saved time as the basis for its time.
If so you could base it on that naming progression. Otherwise you could do it based on a "find" output based on creation time or access time. Save that time from one run into a file and have the next run read that saved time as the basis for its time.
Yes the files all have date and time in them like this: DomeCCTV_216576543_20180904183710574_MOTION_DETECTION.jpg
I know I can find the last image using:
LASTIMAGE=$(ls | tail -1)
But don't know how to pass this to the loop to start from that file.
Moving this to the Programming forum to gain it better exposure.
Perhaps using date or last modified on flag.
But it depends on how the FILES list is constructed.
Memory of when the script was last run could be as simple as a specific file, visible or hidden, which you touch each time to pinpoint the last time+day when you've processed the directory.
I have done something like that in bash, by storing a list of already-processed files and using diff to find the new files.
Example:
Code:
# list all files in directory which match the pattern
ls *.jpg > allfiles.txt
# create oldfiles.txt if it does not already exist
touch oldfiles.txt
# find files in allfiles.txt which are not in oldfiles.txt, create newfiles.txt
diff allfiles.txt oldfiles.txt | grep "<" | cut -d " " -f 2 > newfiles.txt
# process list
for item in $(cat newfiles.txt)
do
# do stuff
# append filename to oldfiles.txt
echo $item >> oldfiles.txt
done
# sort oldfiles.txt (necessary if files are created out of ls order)
sort oldfiles.txt > temp.txt
mv temp.txt oldfiles.txt
Since you are running the script every minute, you may wonder what happens if the run time is more than one minute. This could happen if there are many new files since the last script execution. You would end up with two or more simultaneously running scripts, which would duplicate the processing and leave duplicate entries in the list of old files. There is even a small chance it could garble the file lists. Not good!
The way I avoid this is by running a test at the beginning of the script:
Example:
Code:
# check and exit if task is already running
# extract script name (the bit after the last slash character)
thiscom=$(echo $0 | rev | cut -d'/' -f 1 | rev)
# count processes associated with that name (plus one for the final newline, apparently)
if [ $(ps -C "$thiscom" -o comm= | wc -l) -gt 2 ]
then
# exit due to duplicate process
exit
fi
# here start the file listing and processing...
A shorter and safer way of the post #5 (allows special characters in file names):
Code:
donefile=oldfiles.txt
for item in *.jpg
do
# skip none-files and the done files
if [ ! -f "$item" ] || fgrep -qx "$item" $donefile
then
echo "skipping $item"
continue
fi
# do stuff
# append filename to done files
echo "$item" >> $donefile
done
Thought I'd take a go at this with python.
It monitors the directories you specify by checking the directories modification date, then checking if the file has already been processed in that directory.
If something has changed, it'll run the command on that file. So something like
You can run it in a daemon mode with -d or --daemon where it simply checks every second if something happened.
Code:
usage: monitor_directory.py [-h] [-d] [--temp-file TEMP_FILE] [--trim-cache]
[--include INCLUDE] [--exclude EXCLUDE]
command [directories [directories ...]]
Monitors directory for changes and runs command on new files
positional arguments:
command Run command or script on each file: ./script
file_foobar
directories
optional arguments:
-h, --help show this help message and exit
-d, --daemon
--temp-file TEMP_FILE
Location of cache file, default /tmp
--trim-cache Remove irrelevant directories from the cache
--include INCLUDE glob file matching, can be invoked multiple times
--exclude EXCLUDE glob file matching, can be invoked multiple times
monitor_directory.py
Code:
#!/usr/bin/env python3
import argparse
import pathlib
import json
import os
import subprocess
import tempfile
import time
def main():
parser = argparse.ArgumentParser(description="Monitors directory for changes and runs command on new files")
parser.add_argument("-d", "--daemon",
action="store_true",
)
parser.add_argument("--temp-file",
default=os.path.join(tempfile.gettempdir(), "monitor.tmp"),
help="Location of cache file, default {}".format(tempfile.gettempdir())
)
parser.add_argument("--trim-cache",
action="store_true",
help="Remove irrelevant directories from the cache")
parser.add_argument("--include",
action="append",
help="glob file matching, can be invoked multiple times",
)
parser.add_argument("--exclude",
action="append",
help="glob file matching, can be invoked multiple times",
)
parser.add_argument("command",
nargs=1,
help="Run command or script on each file: ./script file_foobar",
)
parser.add_argument("directories",
default=[os.getcwd()],
nargs='*',
)
args = parser.parse_args()
# If is file, treat it as a script. If not, run as a command
# Get scripts absolute path
if os.path.isfile(os.path.abspath(args.command[0])):
command = os.path.abspath(args.command[0])
else:
command = args.command[0]
temp_file = args.temp_file
args.directories = [os.path.abspath(directory) for directory in args.directories]
# Manage loading of cache_directories.
# Used to know when things have been processed or not.
# If it doesn't exist, create one
if os.path.isfile(temp_file):
with open(temp_file) as f:
cached_directories = json.load(f)
# --trim option
# If the directory isn't specified and is in the cache, remove it
if args.trim_cache is True:
non_existing_directories = list()
for directory in cached_directories:
if directory not in args.directories:
non_existing_directories.append(directory)
for directory in non_existing_directories:
cached_directories.pop(directory)
else:
cached_directories = dict()
# Decide to run as daemon or script.
# Having updated values will trigger a write to the temporary file
if args.daemon is True:
while True:
try:
result_cached_directories = process_files_command(command,
cached_directories,
args.directories,
include=args.include,
exclude=args.exclude)
if result_cached_directories is True:
write_json_file(cached_directories, temp_file)
time.sleep(1)
except KeyboardInterrupt:
write_json_file(cached_directories, temp_file)
break
else:
result_cached_directories = process_files_command(command,
cached_directories,
args.directories,
include=args.include,
exclude=args.exclude)
if result_cached_directories is True:
write_json_file(cached_directories, temp_file)
def process_files_command(command, cache_dictionary, directories, include=None, exclude=None):
def run_command(directory, file, command):
subprocess.run(command.split(" ") + [os.path.join(directory, file)])
cache_dictionary[directory][1].add(file)
for directory in cache_dictionary:
# Convert list of processed files to a set
cache_dictionary[directory][1] = set(cache_dictionary[directory][1])
for directory in directories:
if directory in cache_dictionary:
cached_directory_time = cache_dictionary[directory][0]
current_directory_time = os.stat(directory).st_mtime
if current_directory_time > cached_directory_time:
if include or exclude:
directory_files = file_include_exclude(directory=directory, include=include, exclude=exclude)
else:
directory_files = (file
for file in os.listdir(directory)
if os.path.isfile(os.path.join(directory, file)))
# Check to see if the file is in the cache.
# Ignore if so.
for file in directory_files:
if file not in cache_dictionary[directory][1]:
run_command(directory, file, command)
cache_dictionary[directory][1].add(file)
cache_dictionary[directory][0] = current_directory_time
# Returning True indicates a change occurred. So the cache file should be written.
return True
else:
cache_dictionary[directory] = [os.stat(directory).st_mtime, set()]
if include or exclude:
directory_files = file_include_exclude(directory=directory, include=include, exclude=exclude)
else:
directory_files = (file
for file in os.listdir(directory)
if os.path.isfile(os.path.join(directory, file)))
for file in directory_files:
run_command(directory, file, command)
return True
def write_json_file(dictionary, temp_file):
cached_directories = dictionary
with open(temp_file, "w") as temp_file_write_object:
for directory in cached_directories:
# Convert list of processed files to a set
cached_directories[directory][1] = list(cached_directories[directory][1])
json.dump(cached_directories, temp_file_write_object, indent=4, sort_keys=True)
def file_include_exclude(*, directory, include, exclude):
files = [file for file in os.listdir(directory)]
if include:
included_filenames = {file for glob_match in include
for file in files
if pathlib.PurePath(file).match(glob_match)}
else:
included_filenames = set()
if exclude:
excluded_filenames = {file for glob_match in exclude
for file in files
if pathlib.PurePath(file).match(glob_match)}
else:
excluded_filenames = set()
for file in files:
if file in included_filenames:
yield file
elif file not in excluded_filenames and len(excluded_filenames) > 0:
yield file
if __name__ == '__main__':
main()
I'll paste source link when I'm not to be flagged for posting links
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.