[SOLVED] Bash script modification: Allowing spaces in directory/file name searches

aadams · 03-12-2015, 08:06 PM

OS: RHEL5
Scripting Experience: Beginner
Bash

Currently, myscript.sh will check the contents of fileA and compare it to directoryA and all of its subdirectory names and file names. If myscript.sh finds matches, it writes the matches to $1_found.csv. If myscript.sh looks at the line-by-line content inside fileA and is unable to match any entries to anything inside directoryA, it then writes each of those items to $1_notfound.csv. This is how it should work.

As it stands, the problem with this is that directoryA and fileA both contain some files whos names contain spaces (Example: "My file 1" and "Myfile2"). So, when myscript.sh runs, it finds "Myfile2" in both fileA and directoryA, then writes the name of "Myfile2" to $1_found.csv. That's perfect. However, "My file 1" is read as 3 seperate files even though it is actually present in fileA and directoryA, its name gets written to $1_notfound.csv, because it has spaces in its name. I need "My file 1" and any other file with spaces in the name to get written to $1_found.csv, if it exists in both locations.

So, I am looking for a nice fix for this problem. Matching directory and filenames between fileA and directoryA sholud still be written to $1_found.csv, even if the name has spaces in it. I know it's not a good practice to use filenames with spaces in Linux, but there are other stakeholders involved and files which came from other OS platforms (e.g. Windows) and I don't have the luxury of simply replacing all the spaces with an underscore for example.

I GREATLY appreciate any help on this!

Here's the script...

printf "\n"
echo "--- File Existance Script ----------------------------------"
echo "Script Started: `date`"
echo "Clearing log files..."
rm $1_found.csv
rm $1_notfound.csv
echo "Starting File Check..."
ROW_COUNTER=0
FOUND_COUNTER=0
NOTFOUND_COUNTER=0

while IFS="," read path id
do
for f in $path;
do
if [ -f $f ]
then
FOUND_COUNTER=$((FOUND_COUNTER+1))
echo "$id,$f" >> $1_found.csv;
else
NOTFOUND_COUNTER=$((NOTFOUND_COUNTER+1))
echo "$id,$f" >> $1_notfound.csv
fi
done
ROW_COUNTER=$((ROW_COUNTER+1))
printf "\rProcessing Record(s): $ROW_COUNTER"
done < $1
printf "\n"
echo "Script Ended: `date`"
printf "\n"
echo "Files Found: $FOUND_COUNTER"
echo "Files NOT Found: $NOTFOUND_COUNTER"
echo "------------------------------------------------------------"
printf "\n"

T3RM1NVT0R · 03-12-2015, 08:35 PM

How about using double quotes in the source file / fileA in your case. You can use the following sed command to put double quotes "" in all file names that you have in file A:

Code:

cat fileA | sed 's/.*/\"&\"/g' > newfilename

Once you get quotes you can rename the newfilename to fileA and can try it against your script.

Miati · 03-12-2015, 09:52 PM

For loops generally don't work well when reading from file lists.
Here's a quick little script to demonstrate my point

If I have a file named content with the 3 lines:

Code:

foo
bar
foo bar

I want to echo each line so I type in this script to test both ways

Code:

while read lines
        do
                for i in $lines
                        do
                                echo "$i"
                done
done < "$1"

echo ""

while read lines
        do
                echo "$lines"
done < "$1"

Then I run as ./filelister content

and I get this:

Code:

foo
bar
foo
bar

foo
bar
foo bar

Notice the second (while loop) kept the line together while the first (for loop) did not despite being very similar otherwise.
This makes sense if you think about how for loops work.

Code:

for i in A B C D; do echo $i; done

This will echo out A B C D, each going through it's own loop
But what if it was read on a file that went like this?

Code:

A
B C D

Doesn't matter, if it's being written to a for loop like A B C D it'd going to treat each one separately.
Reading it from a while loop prevents this since it deals with each line separately before moving on.

grail · 03-12-2015, 11:53 PM

Ok, first off, please place code/data in [code][/code] tags to make it more readable.

Other than that, I see a few issues:

1. All places where you refer to :- $1_anything ... This will not behave at all like you expect. I would suggesting put set -xv on the second line of your script and checking the output provided.

2. As advised above, when in doubt quote. As a beginner, I would suggest you put double quotes around ALL variables until you are ready to experiment with what cases they might not be needed

3. $() is preferred over `` as it is both clearer and can be nested easily, should you require.

4. -f advises 'True if file exists and is a regular file.' ... is there any requirement if it exists but is a different type of file? (ie. sym link or directory)

5. Try naming your variables so things like the passed in parameters are removed. in this code it is not such an issue, but once you start creating functions and the like you will start to have many $1
values and it would be nice to know what each refers to and that we are not looking at global variables when we shouldn't be.

6. I have to presume you pass the name of the file to be read to the script ... there is no usage statement to advise what I need to do, nor any test to make sure it exists in the first place??

7. As you provided no example input data, from the script I ascertain that it is a csv file with a path and an id.
a. Assuming 'path' is something like 'directoryA', the for loop will run exactly once and test to see if there exists a file in my current directory called 'directoryA'
b. If a path to a file, 'directoryA/fileA', it will again run only once and now need the sub-directory and the file to exist
c. If it contains a glob, 'directoryA/*', now it will run multiple times (assuming sub-directory exists, otherwise it runs once), but unless the 'id' is meaningful enough, all data from all sub-directories will end up in the same found/notfound files (not sure if this is desired or not)

8. Info :- ++ increments work in bash (below are equivalent):

Code:

FOUND_COUNTER=$((FOUND_COUNTER+1))

((FOUND_COUNTER++))

9. Personal preference :- not sure I see the point of a mix of printf and echo statements? Even better, you could use heredocs

Hope some of that helps

aadams · 03-17-2015, 01:22 PM

Thank you everyone for the feedback. grail and T3RM1NVT0R, your suggestions were definitely beneficial to me, but I still haven't been able to figure out a solution for this particular problem. My fileA is a filelist that is a .csv file with commas as delimiters. directoryA is a large (360GB) directory structure that contains regular files and a few irregular files throughout the directory tree. I thought it was odd/interesting that when I did cat fileA or vim fileA, the output looked clean, however, after running my search script against fileA, my notfound.csv file now lists its entries in a strange way, not at all like the entries looked in fileA, prior to running the search script. So, I presume my search script isn't reading fileA correctly. What is it about the search script that's not reading or translating my fileA correctly?

fileA is not broken up at the spaces when using cat or vim to view it, however, notfound.csv breaks the entries up just after the spaces.

Note the spaces after "BM". fileA lists all the entries like this, as a single line.

#cat fileA.csv

/content/bogusdirectory/BM BOSIS-00-B00-00297-00/BM BOSIS-00-G10-00297-00_991_A.*,13999
/content/bogusdirectory/BM BOSIS-00-B00-00299-00/BM BOSIS-00-G10-00299-00_991_A.*,14000

notfound.csv lists all the entries, but they are broken up at the spaces and put on seperate lines inside the file. Again, all the entries here in notfound.csv came from the filelist found inside fileA, but the search script read fileA and compared it to directoryA, then if it didn't find an entry in directoryA, the search script wrote the entry to notfound.csv, but when it found a space in an entry, it put the subsequent characters on a new line.

cat notfound.csv

,/content/bogusdirectory/BM
,BOSIS-00-B00-00297-00/BM
,BOSIS-00-B00-00297-00_991_A.*
,/content/bogusdirectory/BM
,BOSIS-00-B00-00299-00/BM
,BOSIS-00-B00-00299-00_991_A.*

grail · 03-17-2015, 08:34 PM

Ok ... so now that we have some data

The for loop below will break the string using IFS and hence white space is used and your string are not what you expect:

Code:

for f in $path;

So my suggestion would be to remove the '*' from the fileA.csv file and then we can use the globbing in the for loop:

Code:

$ cat fileA.csv
/content/bogusdirectory/BM BOSIS-00-B00-00297-00/BM BOSIS-00-G10-00297-00_991_A.,13999
/content/bogusdirectory/BM BOSIS-00-B00-00299-00/BM BOSIS-00-G10-00299-00_991_A.,14000

# then inside script use
while IFS="," read path id
do
  for f in "$path"*
  do

Let me know if that works out for you

aadams · 03-18-2015, 07:36 AM

grail, I have removed the '*' from all lines in fileA and used the globbing in the for loop as recommended. Now, I get the output listed below. Line 16 is where you'll find... if [ -f $f ]. Much gratitude for trying to help me figure this out.

Output after making the above changes, then running search.sh againt fileA:

$ ./search.sh fileA.csv

--- File Existance Script ----------------------------------
Script Started: Wed Mar 18 08:18:47 EDT 2015
Clearing log files...
Starting File Check...
./search.sh: line 16: [: too many arguments
Processing Record(s): 1./search.sh: line 16: [: too many arguments
Processing Record(s): 2./search.sh: line 16: [: too many arguments
Processing Record(s): 3./search.sh: line 16: [: too many arguments
Processing Record(s): 4./search.sh: line 16: [: too many arguments
Processing Record(s): 5./search.sh: line 16: [: too many arguments
Processing Record(s): 6./search.sh: line 16: [: too many arguments
Processing Record(s): 7./search.sh: line 16: [: too many arguments
Processing Record(s): 8./search.sh: line 16: [: too many arguments
Processing Record(s): 9./search.sh: line 16: [: too many arguments
Processing Record(s): 10./search.sh: line 16: [: too many arguments
Processing Record(s): 11./search.sh: line 16: [: too many arguments
Processing Record(s): 12./search.sh: line 16: [: too many arguments
Processing Record(s): 17
Script Ended: Wed Mar 18 08:18:47 EDT 2015

Files Found: 4
Files NOT Found: 13

Code, showing line 16 at "if":

[code] do
for f in "$path"*
do
if [ -f $f ]
then
FOUND_COUNTER=$((FOUND_COUNTER+1))
echo "$id,$f" >> $1_found.csv;
else
NOTFOUND_COUNTER=$((NOTFOUND_COUNTER+1))
echo "$id,$f" >> $1_notfound.csv
fi
done
[code]

pan64 · 03-18-2015, 08:27 AM

please use [code]here comes your code[/code]
Probably you need to use:

Code:

if [ -f "$f" ]

grail · 03-18-2015, 08:39 PM

I am with pan64 on this one ... as with point 2 of my first post. You can almost never have enough quoting in bash scripts.

aadams · 03-18-2015, 10:57 PM

You guys rock! Thanks so much for the help! It's working as desired now. I added a line in the search script to have sed strip '*' from each line of fileA, then put double quotes and '*' here in the script... for f in"$path"*. The double quotes here were also necessary. if [ -f "$f" ]

grail · 03-18-2015, 11:56 PM

Glad we got there in the end

Please remember to mark ticket as SOLVED.