LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   bash: compare entries in file against a directory and another file - then print based on pattern (https://www.linuxquestions.org/questions/programming-9/bash-compare-entries-in-file-against-a-directory-and-another-file-then-print-based-on-pattern-4175620402/)

azurite 12-27-2017 04:22 PM

bash: compare entries in file against a directory and another file - then print based on pattern
 
Hello everyone,

I am fairly new to bash scripting and I need some advice. Here's my situation.
I have a File 1 with about ~130 entries (it's a column of 6 digit numbers). I want to compare this file (and an amended version of the entries in this file) against a directory and it's subfolders. Then I want to take that list and compare it against a csv file based on column 1 then based on pattern match in column 3. Please see the steps below, I've included explanations for each as it might be easier to see the desired output and understand my goal.

Code:

File 1
100123-100
100456-100
100789-100
200123-100
200456-100
etc

I want to:

1. Compare this with a directory to check if there are subfolders matching the entries from file 1. I also want it to print out a message if the subfolder exists and if the subfolder doesn't exist.
Code:

Desired Output
100123-100 doesn't exist
100456-100 exists
100789-100 exists
200123-100 doesn't exist
200456-100 exists

2. Take the entries from File 1 and add a "_2" to the end of each entry and again compare this against the directory to see if there are subfolders matching the list from File 1. Also print out messages telling me if it exists or not.
Code:

Desired Output
100123-100_2 exists
100456-100_2 exists
100789-100_2 doesn't exits
200123-100_2 exists
200456-100_2 exists

3. Compare the 2 lists from above and print out entries that are present in both files.
I'm flexible on this step if there is a better way to get entries that have both the original numbered file and the _2 numbered file. So for this example the output would be
Code:

Desired Output
100456-100
100456-100_2
200456-100
200456-100_2

4. Take the list from step 3 and match this against the 1st column of FileCSV. Print the entries.
Code:

FileCSV
100456_100
100456_100_2
200123_100_2
200456_100

Then filter those entries and print only those entries that have an X in column 3 of FileCSV. Let's say for this example, the entries below had X's in the 3rd column of FileCSV
Code:

Desired Output
100456-100
200123-100_2

I have tried to use a combination of comm for step 1 and 2 and diff for some of the beginning steps as well but it's proving to be hard to keep track of all the files that come out of it. I'm hoping there's a better structured or cleaner more efficient way to do this. I am very new to bash scripting, I don't have much knowledge of awk so I have not tried it on my own as of yet.

If anyone can walk me through this and provide explanations, it would be very helpful in allowing me to learn.

Thanks in advance.

keefaz 12-27-2017 07:05 PM

Maybe just use the File 1 as a matching pattern list and extract matched lines in CSV file that have X at 3rd column
Then check if there is a sub directory with name matching the 1st column of extracted lines

matching pattern from File 1 should match either xxxxxx-100 or xxxxxx-100_2 in CSV file

azurite 12-29-2017 01:22 AM

Quote:

Originally Posted by keefaz (Post 5798587)
Maybe just use the File 1 as a matching pattern list and extract matched lines in CSV file that have X at 3rd column
Then check if there is a sub directory with name matching the 1st column of extracted lines

matching pattern from File 1 should match either xxxxxx-100 or xxxxxx-100_2 in CSV file

Could you provide explanation of how I could do that please?
Also, is there a way I can print lines while doing a partial string match?
such that if I give this string 100234-100 it will match and print both 100234-100 and 100234-100_2?

keefaz 12-29-2017 06:53 AM

Assuming CSV file field 1 contains 6 digits number followed by -100 and maybe followed by _2 and field 3 contains X's
Code:

#build pattern list file
sed 's/.*/^&/' file1 > patterns

# extract matched line from CSV file,
# filter output from pattern list,
# test if dir exists, if yes print line
awk -F, '$1 ~ /^[0-9]{6}-100(_2)?$/ && $3 ~ /X/' CSV_file | grep -f patterns | awk -F, 'system("test -d "$1) == 0'

I'm not too good at awk, I wouldn't be surprised if someone else post a better code
If I had to choose, I'd write a perl script and store patterns in hash, should be faster

azurite 12-29-2017 03:11 PM

Quote:

Originally Posted by keefaz (Post 5799222)
Assuming CSV file field 1 contains 6 digits number followed by -100 and maybe followed by _2 and field 3 contains X's
Code:

#build pattern list file
sed 's/.*/^&/' file1 > patterns

# extract matched line from CSV file,
# filter output from pattern list,
# test if dir exists, if yes print line
awk -F, '$1 ~ /^[0-9]{6}-100(_2)?$/ && $3 ~ /X/' CSV_file | grep -f patterns | awk -F, 'system("test -d "$1) == 0'

I'm not too good at awk, I wouldn't be surprised if someone else post a better code
If I had to choose, I'd write a perl script and store patterns in hash, should be faster

Could you walk me through what the syntax does. I sort of have an idea but I would like to understand it better.
Also, if I don't want to test if the directory exists, just want to match if those numbers are in the csv file, I would take out the "awk -F 'system("test -d" bit correct?

I would think there'd be an easier way to do this. ...alas I'm still learning all of this.

keefaz 12-29-2017 05:00 PM

Depending on your setup, yes there are faster ways.
You're correct on removing the last awk for skipping the directory test.

The major part of a solution is a well understood problem.
I admit it's not very efficient as it, if I had a precise idea about directory structure, subdirectories names, amount of directories, real lines examples from CSV file, exact requested goal etc I would do things differently

Code:

# add a ^ character in front of each line of file1
# so we can use a more restrictive regexp pattern
# -> search only the lines that start with the pattern in CVS_file
#
# then save in patterns file
sed 's/.*/^&/' file1 > patterns

Code:

# use comma as field separator,
# test if the first field contains a 6-digits number
# followed by -100, followed by _2 (or not)
awk -F, '$1 ~ /^[0-9]{6}-100(_2)?$/

Code:

# test if field 3 contains X
&& $3 ~ /X/' CSV_file

If both tests succeed, awk will print the line to screen(implicit)

Code:

# test if outputed lines from previous command
# match against regexp in pattern file
 | grep -f patterns

Code:

# test if a directory named with first field value exists,
# if yes print the line (implicit)
 | awk -F, 'system("test -d "$1) == 0'



All times are GMT -5. The time now is 03:19 PM.