seeking HOWTO -- script mark files 'done' in a long list of files to process

SaintDanBert · 07-21-2016, 11:31 AM

There are all sorts of ways to generate a list of files that you want to process within a script. Does anyone have an "elegant" way to mark-off each completed file?

My challenge occurs when the desired processing involves creation of a tar-archive or similar container. In those cases, the open-modify-close operations on the container result in a huge amount of overhead. In addition, there are often unwanted side effects with the resulting content of the container.

Using a loop:

Code:

    # create list-of-files
    # Get a filespec from the list
    # process it
    # mark-it done

works well for operations such as filtering photo image files or video, altering standard parameters in documents, bulk changes to source code, and so on.

NOTE -- In ancient times, MS-DOS had a command 'xcopy' that could mark files when the copy completed.

Merci d'avance,
~~~ 0;-Dan

pan64 · 07-22-2016, 05:09 AM

looks like you need some kind of make tool (like make itself)

Habitual · 07-22-2016, 07:39 AM

Quote:

Originally Posted by SaintDanBert

MS-DOS had a command 'xcopy' that could mark files when the copy completed.

Smacks of rsync.

What the heck does "mark-it done" mean?

chrism01 · 08-03-2016, 02:53 AM

If you are working your way down a list there's no need to to 'mark it done'....
If you mean you may (for some odd reason) end up re-generating the list part way through or similar, I'd just create a 'done' dir and move each file into there immediately after you have finished with it. This is (part of) a classic technique for processing continuously incoming files.

tfjonesjr · 08-03-2016, 10:09 AM

I've had the same challenge and ended up renaming each file when it's been processed. I usually prefix the filename with "done-". The benefit of that is users can monitor the folder and see that files have or have not been processed. You can also have a file processed again by manually renaming and removing the "done-" prefix.

SaintDanBert · 08-08-2016, 12:59 PM

Quote:

Originally Posted by Habitual

Smacks of rsync.

What the heck does "mark-it done" mean?

mark-it-done ===>

get a filespec from the to-do list
process that filespec
somehow record that you worked that filespec
- delete the filespec from the to-do list
- alter the to-do list entry for that filespec
- write a separate done-list with that filespec
- ...
repeat for all un-processed filespecs

If processing must be restarted, you can avoid processing items with the mark-it-done status and resume work with those that you have not processed.

~~~ 0;-Dan

SaintDanBert · 08-08-2016, 01:03 PM

Quote:

Originally Posted by chrism01

If you are working your way down a list there's no need to to 'mark it done'....
If you mean you may (for some odd reason) end up re-generating the list part way through or similar, I'd just create a 'done' dir and move each file into there immediately after you have finished with it. This is (part of) a classic technique for processing continuously incoming files.

I like this idea for one class of files that I'll be processing -- media cards (SD, CF, thumb, etc) -- but it would be trouble for a live file system.

That said, it might work to use a done folder and fill it with symlinks as the to-do list. Then I could remove the links as I process things leaving behind what reamains to-do.

Nice thinking, chrism01;5585085
~~~ 0;-Dan

SaintDanBert · 08-08-2016, 01:05 PM

Quote:

Originally Posted by pan64

looks like you need some kind of make tool (like make itself)

Ah, the venerable make... and its decendants. I'll need to ponder this option a looooooooonnnnggggg time.

Thanks,
~~~ 0;-Dan

schneidz · 08-08-2016, 01:19 PM

what the hell is a filespec ?

chrism01 · 08-09-2016, 12:42 AM

nw: like I said its a classic soln in eg trading banks (trades come in as files initially).
Also, create a new dir every eg mth for a) ease of finding stuff, b) avoid hitting limit on num files/dir in long run.
If this is really a long term soln, you also need to archive off eventually, or you will run out of inodes possibly even before running out of disk space.

SaintDanBert · 08-09-2016, 02:49 PM

Quote:

Originally Posted by schneidz

what the hell is a filespec ?

I'm sorry, I've used that term for decades, but then I'm a serious dinosaur.
In general, a 'filespec' is a file specification -- /path1/.../pathN/filename.type
If there is a network involved -- username@hostname:/path1/.../pathN/filename.type

SaintDanBert · 08-09-2016, 02:55 PM

Quote:

Originally Posted by chrism01

nw: like I said its a classic soln in eg trading banks (trades come in as files initially).
Also, create a new dir every eg mth for a) ease of finding stuff, b) avoid hitting limit on num files/dir in long run.
If this is really a long term soln, you also need to archive off eventually, or you will run out of inodes possibly even before running out of disk space.

All good points that I'd likely not considered until things started failing.

To restate my original requirement, I need to make tar-balls from sets of files. These runs can take lots of wall-clock time. That means that there are lots of opportunities for the run to get interrupted by power or network troubles. It is okay to have tar-ball-1, tar-ball-2, ..., tar-ball-N of varied sizes. My primary concern is that I be able to (1) resume processing after an interruption, and (2) avoid processing input files repeatedly.

Thanks,
~~~ 0;-Dan

chrism01 · 08-10-2016, 03:35 AM

Y: so use a 'done' dir for ones that are complete immediately they are completed. This solves the restartability issue.
You may even (paranoia mode) touch a done file just after completing a tar but before mv'ing tar-ball to done dir.
This deals with the faint possibility of failure right at the last possible millisecond

Quote:

The paranoid programmer assumes the system is out to get them and acts accordingly