Find command returns files I don't want

harboa · 02-07-2018, 11:00 AM

I hope this is in the correct area!

I've written a script to receive parameters (SOURCEDIR, TARGETDIR, FILEMASK, FILEMIN) to search for files and move them if they match the criteria. There is more to the complete script but this is the FIND command I am using :-

Code:

for file in `find ${SOURCEDIR} -maxdepth 1 -name "${FILEMASK}" -type f -mmin +${FILEMIN}`; do

However, it finds other files that I am NOT looking for (which are handled automatically by another process). For example, looking for a filename "*ER1_100*", it finds "EDI1160_Z_SHP_OBDLV_SAVE_REPLICA_22648345".

Am I getting the name process wrong or is there some funny way that the FIND process works that I don't understand (quite likely..!!).

Any help would be really appreciated.

rknichols · 02-07-2018, 11:30 AM

Quote:

Originally Posted by harboa

Code:

for file in `find ${SOURCEDIR} -maxdepth 1 -name "${FILEMASK}" -type f -mmin +${FILEMIN}`; do

There's nothing obviously wrong with that find command. Try running the script with the "-x" option ("bash -x path/to/script ...") to see how that command is actually being invoked.

BW-userx · 02-07-2018, 12:01 PM

you could easily use sub strings and find something like this

Code:

#!/bin/bash

FILEMASK="ER1_100"
SOURCEDIR=
TARGETDIR=
FILEMIN=

while read f ; do

	
	xpath=${f%/*} 
	xbase=${f##*/}
	xfext=${xbase##*.}
	xpref=${xbase%.*}
	path=${xpath}
	pref=${xpref}
	ext=${xfext}

 mkdir -p "$TARGETDIR"


// pref is just the file name minus everything else . ie path and extension if one.
  // do whatever you want to it here when it matches your sub string aka filemask.

[[ $pref =~ "$FILEMASK" ]] && mv -v "$f" "$TARGETDIR"
 


done <<<"$(find ${SOURCEDIR} -maxdepth 1 -type f -mmin +${FILEMIN})"

it is still basically doing the same thing, this i way you could also set up different sub strings for more then one file type name and process them all within the same script as well. Just by adding different sub string variables and using if else statements.

AwesomeMachine · 02-07-2018, 06:18 PM

I don't think the wildcard '*' characters are going to work correctly.

rknichols · 02-07-2018, 06:51 PM

Quote:

Originally Posted by AwesomeMachine

I don't think the wildcard '*' characters are going to work correctly.

As long as they are in a quoted string so that the shell won't try to expand them, they will work just fine in find.

Note that a construct like "for F in `find ...`; do ...; done` is going to misbehave quite badly of any of the found names contains embedded whitespace characters. The "for" loop is going to interpret each of those words as separate names. If the names contain any shell meta-characters, the shell will try to process those, too, before assigning the result to the variable.

harboa · 02-21-2018, 03:44 AM

Quote:

Originally Posted by rknichols

There's nothing obviously wrong with that find command. Try running the script with the "-x" option ("bash -x path/to/script ...") to see how that command is actually being invoked.

I've got the output now from the SAP guys (who run the script in some weird way..its not scheduled in CRON)

Code:

[ -z ER2 ]
basename /usr/sap/ER2/interfaces/ER2/100/scripts/movefile.sh .sh
basename=movefile
dirname /usr/sap/ER2/interfaces/ER2/100/scripts/movefile.sh
BASEDIR=/usr/sap/ER2/interfaces/ER2/100/scripts
date '+%d/%m/%y %X'
DATE='16/02/18 10:08:50 AM'
awk -FER2 '{print length($0)-length($NF)+2}'
echo /usr/sap/ER2/interfaces/ER2/100/scripts
SIDPOS=29
expr substr /usr/sap/ER2/interfaces/ER2/100/scripts 29 3
CLIENT=100
BASEPATH=/usr/sap/ER2/interfaces/ER2/100
date +%Y%m
LOGFILE=/usr/sap/ER2/interfaces/ER2/100/log/movefile_201802.log
date '+%d/%m/%y %X'
echo '16/02/18 10:08:50 AM - Start process..'
1>> /usr/sap/ER2/interfaces/ER2/100/log/movefile_201802.log
[ -d /usr/sap/ER2/interfaces/ER2/100/out/ -a -d /usr/sap/ER2/interfaces/ER2/100/out/GIS/ ]
[ 2 -ge 0 ]
find /usr/sap/ER2/interfaces/ER2/100/out/ -maxdepth 1 -name '*SEPA_PAIN*' -type f -mmin +2
echo '                       Moving file /usr/sap/ER2/interfaces/ER2/100/out/SEPA_PAIN03_5539.xml from /usr/sap/ER2/interfaces
1>> /usr/sap/ER2/interfaces/ER2/100/log/movefile_201802.log
mv /usr/sap/ER2/interfaces/ER2/100/out/SEPA_PAIN03_5539.xml /usr/sap/ER2/interfaces/ER2/100/out/GIS/
date '+%d/%m/%y %X'
echo '16/02/18 10:08:50 AM - End process..'
1>> /usr/sap/ER2/interfaces/ER2/100/log/movefile_201802.log
[ Y '==' Y ]

Everything seems normal to me and, in this case, it worked. I think the trouble here is going to be catching the failure as it doesn't always happen (obviously..)

It's almost like the find does things in stages (like find the files older than 2 mins and then run themask check against them). Is this possible?

keefaz · 02-21-2018, 04:41 AM

If the find failure concerns the filenames, there are not many possible sources of trouble.
It's about what ${FILEMASK} contains, so I will troubleshoot how its value is assigned, put some echo ${FILEMASK} trough the script
Of course it helps to use same conditions as when the script fails (not when it's working)

harboa · 02-21-2018, 07:07 AM

Quote:

Originally Posted by keefaz

If the find failure concerns the filenames, there are not many possible sources of trouble.
It's about what ${FILEMASK} contains, so I will troubleshoot how its value is assigned, put some echo ${FILEMASK} trough the script
Of course it helps to use same conditions as when the script fails (not when it's working)

The FILEMASK is not changed at any point in the script and contains the correct value at time of the find execution (in this case "*SEPA_PAIN*").

This is an intermittent issue so we cannot duplicate whenever we want. The directory where we are searching for the files is dynamic (so files can be added and/or removed while this script is running) so I was wondering whether the find is creating a list of the files (internally) and then trying to process them and finding that one of the files in it's list is no longer there. Is this possible?

pan64 · 02-21-2018, 08:27 AM

I would try find without the for loop. use find -D (see man page about it) if you wish to debug find (=to understand how it works).

Yes that is possible find found a file but has been deleted before processing.

harboa · 02-21-2018, 08:59 AM

Quote:

Originally Posted by pan64

I would try find without the for loop. use find -D (see man page about it) if you wish to debug find (=to understand how it works).

Yes that is possible find found a file but has been deleted before processing.

I suppose a better phrasing of my question would be - would find list all the files internally and then try and action the parameters against that list. The thing I find hard to understand is why, when I have given the filename mask to use (in this case '*SEPA_PAIN*'), does it complain about a file that it should not even be looking at (ie. the filename does not contain the filemask specified)

I have tried to use the -D option but I'm afraid I'm left non the wiser after viewing the output - it didn't make a lot of sense to me..!

rknichols · 02-21-2018, 09:33 AM

I've looked at the execution of find with ltrace, and I see that when it processes a directory it first reads and saves all the names unconditionally. Then, taking the names one at a time, it performs all tests on each name before moving on to the next.

By any chance do any of the names in the directory contain any embedded shell meta-characters? Any such name returned by find is going to get expanded by the shell since pathname expansion occurs after command substitution. Combine that with some embedded whitespace characters in the names and your "for ..." loop could end up doing almost anything. For example, an unlikely name "EDI* ER1_100" could satisfy the name test in your find command and would result in processing all files with names that start with "EDI". It's generally a lot safer to let find directly invoke the processing (with "-exec") rather than letting a shell process the names first.

harboa · 02-21-2018, 10:12 AM

Quote:

Originally Posted by rknichols

I've looked at the execution of find with ltrace, and I see that when it processes a directory it first reads and saves all the names unconditionally. Then, taking the names one at a time, it performs all tests on each name before moving on to the next.

By any chance do any of the names in the directory contain any embedded shell meta-characters? Any such name returned by find is going to get expanded by the shell since pathname expansion occurs after command substitution. Combine that with some embedded whitespace characters in the names and your "for ..." loop could end up doing almost anything. For example, an unlikely name "EDI* ER1_100" could satisfy the name test in your find command and would result in processing all files with names that start with "EDI". It's generally a lot safer to let find directly invoke the processing (with "-exec") rather than letting a shell process the names first.

I thought it might be doing something like this but had no idea how to prove it..! So thanks for that - it explains a lot.

None of the names should be "funny" - they are all coming out from SAP so will be fairly standard creation - nothing manual!

From "-exec" perspective, hadn't really thought about that. So, excuse any errors in this attempt, would it be something like this (based on original find)..?

Code:

find ${SOURCEDIR} -maxdepth 1 -name "${FILEMASK}" -type f -mmin +${FILEMIN} -exec mv {} ${NewPath} \;

Is that about right?

keefaz · 02-21-2018, 10:49 AM

Maybe error comes not from find but from following commands in the loop

What is the exact evidence that proves that find returns wrong files in the script?

harboa · 02-21-2018, 10:56 AM

Quote:

Originally Posted by keefaz

Maybe error comes not from find but from following commands in the loop

What is the exact evidence that proves that find returns wrong files in the script?

As I said in my original post..

For example, looking for a filename "*ER1_100*", it finds "EDI1160_Z_SHP_OBDLV_SAVE_REPLICA_22648345"

The error we got in the script was "file not found" as the "EDI1160" mentioned had already been moved to another directory bu another script and, in any case, wasn't a file that I wanted to perform any actions on.

pan64 · 02-21-2018, 11:21 AM

As it was already explained by rknichols find first reads the directory and next it processes this list. It is a common problem if you modifies the directory during that process. find may fail just because it cannot follow that change. For example find still thinks the first entry in a directory matches, but in the meantime you removed that file from that dir, but a new one was created - so find will print the first file, which is now incorrect. I don't know the internals of find, but that is what I can imagine (or something similar).
But I have other comments too, which may help:
for i in $(find ... )
is not a good construct, because find is a loop itself, putting the result in another loop is superfluous. One thing which will definitely happen: find will run in a new shell and that for loop will not be started before the find was completed.
So find starts to work on a directory and for will process the result only after the find completed. and during that time the content of that dir was changed. This is definitely unsafe. Better to use:
find ..... -exec script.sh {} \;
where this script is the same as the main part of your loop (every file will be passed one by one. In this case the script will start earlier (do not need to wait), but it is still unsafe.
If you want to be even much faster you may need to implement that find in python/perl whatever, which may act immediately in case of match (but you cannot completely eliminate this race condition this way).
Also you may want to drop find completely, because ls *MASK* is simply much faster (you may check other attributes later). Although this construct is still unsafe.