LinuxQuestions.org - bash: compare entries in file against a directory and another file

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - bash: compare entries in file against a directory and another file - then print based on pattern (https://www.linuxquestions.org/questions/programming-9/bash-compare-entries-in-file-against-a-directory-and-another-file-then-print-based-on-pattern-4175620402/)

bash: compare entries in file against a directory and another file - then print based on pattern

Hello everyone,

I am fairly new to bash scripting and I need some advice. Here's my situation.
I have a File 1 with about ~130 entries (it's a column of 6 digit numbers). I want to compare this file (and an amended version of the entries in this file) against a directory and it's subfolders. Then I want to take that list and compare it against a csv file based on column 1 then based on pattern match in column 3. Please see the steps below, I've included explanations for each as it might be easier to see the desired output and understand my goal.

Code:

I want to:

1. Compare this with a directory to check if there are subfolders matching the entries from file 1. I also want it to print out a message if the subfolder exists and if the subfolder doesn't exist.

Code:

Desired Output

100123-100 doesn't exist

100456-100 exists

100789-100 exists

200123-100 doesn't exist

200456-100 exists

2. Take the entries from File 1 and add a "_2" to the end of each entry and again compare this against the directory to see if there are subfolders matching the list from File 1. Also print out messages telling me if it exists or not.

Code:

Desired Output

100123-100_2 exists

100456-100_2 exists

100789-100_2 doesn't exits

200123-100_2 exists

200456-100_2 exists

3. Compare the 2 lists from above and print out entries that are present in both files.
I'm flexible on this step if there is a better way to get entries that have both the original numbered file and the _2 numbered file. So for this example the output would be

Code:

Desired Output

100456-100 

100456-100_2

200456-100 

200456-100_2

4. Take the list from step 3 and match this against the 1st column of FileCSV. Print the entries.

Code:

FileCSV

100456_100 

100456_100_2

200123_100_2

200456_100

Then filter those entries and print only those entries that have an X in column 3 of FileCSV. Let's say for this example, the entries below had X's in the 3rd column of FileCSV

Code:

Desired Output

100456-100

200123-100_2

I have tried to use a combination of comm for step 1 and 2 and diff for some of the beginning steps as well but it's proving to be hard to keep track of all the files that come out of it. I'm hoping there's a better structured or cleaner more efficient way to do this. I am very new to bash scripting, I don't have much knowledge of awk so I have not tried it on my own as of yet.

If anyone can walk me through this and provide explanations, it would be very helpful in allowing me to learn.

Thanks in advance.

Maybe just use the File 1 as a matching pattern list and extract matched lines in CSV file that have X at 3rd column
Then check if there is a sub directory with name matching the 1st column of extracted lines

matching pattern from File 1 should match either xxxxxx-100 or xxxxxx-100_2 in CSV file

Quote:

Originally Posted by keefaz (Post 5798587)

Could you provide explanation of how I could do that please?
Also, is there a way I can print lines while doing a partial string match?
such that if I give this string 100234-100 it will match and print both 100234-100 and 100234-100_2?

Assuming CSV file field 1 contains 6 digits number followed by -100 and maybe followed by _2 and field 3 contains X's

Code:

#build pattern list file

sed 's/.*/^&/' file1 > patterns



# extract matched line from CSV file, 

# filter output from pattern list,

# test if dir exists, if yes print line

awk -F, '$1 ~ /^[0-9]{6}-100(_2)?$/ && $3 ~ /X/' CSV_file | grep -f patterns | awk -F, 'system("test -d "$1) == 0'

I'm not too good at awk, I wouldn't be surprised if someone else post a better code
If I had to choose, I'd write a perl script and store patterns in hash, should be faster

Quote:

Originally Posted by keefaz (Post 5799222)

Assuming CSV file field 1 contains 6 digits number followed by -100 and maybe followed by _2 and field 3 contains X's

Code:

#build pattern list file

sed 's/.*/^&/' file1 > patterns



# extract matched line from CSV file, 

# filter output from pattern list,

# test if dir exists, if yes print line

awk -F, '$1 ~ /^[0-9]{6}-100(_2)?$/ && $3 ~ /X/' CSV_file | grep -f patterns | awk -F, 'system("test -d "$1) == 0'

I'm not too good at awk, I wouldn't be surprised if someone else post a better code
If I had to choose, I'd write a perl script and store patterns in hash, should be faster

Could you walk me through what the syntax does. I sort of have an idea but I would like to understand it better.
Also, if I don't want to test if the directory exists, just want to match if those numbers are in the csv file, I would take out the "awk -F 'system("test -d" bit correct?

I would think there'd be an easier way to do this. ...alas I'm still learning all of this.

Depending on your setup, yes there are faster ways.
You're correct on removing the last awk for skipping the directory test.

The major part of a solution is a well understood problem.
I admit it's not very efficient as it, if I had a precise idea about directory structure, subdirectories names, amount of directories, real lines examples from CSV file, exact requested goal etc I would do things differently

Code:

# add a ^ character in front of each line of file1

# so we can use a more restrictive regexp pattern

# -> search only the lines that start with the pattern in CVS_file

#

# then save in patterns file

sed 's/.*/^&/' file1 > patterns

Code:

# use comma as field separator, 

# test if the first field contains a 6-digits number

# followed by -100, followed by _2 (or not)

awk -F, '$1 ~ /^[0-9]{6}-100(_2)?$/

Code:

# test if field 3 contains X

&& $3 ~ /X/' CSV_file

If both tests succeed, awk will print the line to screen(implicit)

Code:

# test if outputed lines from previous command 

# match against regexp in pattern file

 | grep -f patterns

Code:

# test if a directory named with first field value exists, 

# if yes print the line (implicit)

 | awk -F, 'system("test -d "$1) == 0'