LinuxQuestions.org
Help answer threads with 0 replies.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 12-27-2017, 05:22 PM   #1
azurite
LQ Newbie
 
Registered: May 2016
Posts: 29

Rep: Reputation: Disabled
bash: compare entries in file against a directory and another file - then print based on pattern


Hello everyone,

I am fairly new to bash scripting and I need some advice. Here's my situation.
I have a File 1 with about ~130 entries (it's a column of 6 digit numbers). I want to compare this file (and an amended version of the entries in this file) against a directory and it's subfolders. Then I want to take that list and compare it against a csv file based on column 1 then based on pattern match in column 3. Please see the steps below, I've included explanations for each as it might be easier to see the desired output and understand my goal.

Code:
File 1
100123-100
100456-100
100789-100
200123-100
200456-100
etc
I want to:

1. Compare this with a directory to check if there are subfolders matching the entries from file 1. I also want it to print out a message if the subfolder exists and if the subfolder doesn't exist.
Code:
Desired Output
100123-100 doesn't exist
100456-100 exists
100789-100 exists
200123-100 doesn't exist
200456-100 exists
2. Take the entries from File 1 and add a "_2" to the end of each entry and again compare this against the directory to see if there are subfolders matching the list from File 1. Also print out messages telling me if it exists or not.
Code:
Desired Output
100123-100_2 exists
100456-100_2 exists
100789-100_2 doesn't exits
200123-100_2 exists
200456-100_2 exists
3. Compare the 2 lists from above and print out entries that are present in both files.
I'm flexible on this step if there is a better way to get entries that have both the original numbered file and the _2 numbered file. So for this example the output would be
Code:
Desired Output
100456-100 
100456-100_2
200456-100 
200456-100_2
4. Take the list from step 3 and match this against the 1st column of FileCSV. Print the entries.
Code:
FileCSV
100456_100 
100456_100_2
200123_100_2
200456_100
Then filter those entries and print only those entries that have an X in column 3 of FileCSV. Let's say for this example, the entries below had X's in the 3rd column of FileCSV
Code:
Desired Output
100456-100
200123-100_2
I have tried to use a combination of comm for step 1 and 2 and diff for some of the beginning steps as well but it's proving to be hard to keep track of all the files that come out of it. I'm hoping there's a better structured or cleaner more efficient way to do this. I am very new to bash scripting, I don't have much knowledge of awk so I have not tried it on my own as of yet.

If anyone can walk me through this and provide explanations, it would be very helpful in allowing me to learn.

Thanks in advance.

Last edited by azurite; 12-27-2017 at 05:24 PM.
 
Old 12-27-2017, 08:05 PM   #2
keefaz
LQ Guru
 
Registered: Mar 2004
Distribution: Slackware
Posts: 5,681

Rep: Reputation: 507Reputation: 507Reputation: 507Reputation: 507Reputation: 507Reputation: 507
Maybe just use the File 1 as a matching pattern list and extract matched lines in CSV file that have X at 3rd column
Then check if there is a sub directory with name matching the 1st column of extracted lines

matching pattern from File 1 should match either xxxxxx-100 or xxxxxx-100_2 in CSV file
 
Old 12-29-2017, 02:22 AM   #3
azurite
LQ Newbie
 
Registered: May 2016
Posts: 29

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by keefaz View Post
Maybe just use the File 1 as a matching pattern list and extract matched lines in CSV file that have X at 3rd column
Then check if there is a sub directory with name matching the 1st column of extracted lines

matching pattern from File 1 should match either xxxxxx-100 or xxxxxx-100_2 in CSV file
Could you provide explanation of how I could do that please?
Also, is there a way I can print lines while doing a partial string match?
such that if I give this string 100234-100 it will match and print both 100234-100 and 100234-100_2?

Last edited by azurite; 12-29-2017 at 02:24 AM.
 
Old 12-29-2017, 07:53 AM   #4
keefaz
LQ Guru
 
Registered: Mar 2004
Distribution: Slackware
Posts: 5,681

Rep: Reputation: 507Reputation: 507Reputation: 507Reputation: 507Reputation: 507Reputation: 507
Assuming CSV file field 1 contains 6 digits number followed by -100 and maybe followed by _2 and field 3 contains X's
Code:
#build pattern list file
sed 's/.*/^&/' file1 > patterns

# extract matched line from CSV file, 
# filter output from pattern list,
# test if dir exists, if yes print line
awk -F, '$1 ~ /^[0-9]{6}-100(_2)?$/ && $3 ~ /X/' CSV_file | grep -f patterns | awk -F, 'system("test -d "$1) == 0'
I'm not too good at awk, I wouldn't be surprised if someone else post a better code
If I had to choose, I'd write a perl script and store patterns in hash, should be faster
 
Old 12-29-2017, 04:11 PM   #5
azurite
LQ Newbie
 
Registered: May 2016
Posts: 29

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by keefaz View Post
Assuming CSV file field 1 contains 6 digits number followed by -100 and maybe followed by _2 and field 3 contains X's
Code:
#build pattern list file
sed 's/.*/^&/' file1 > patterns

# extract matched line from CSV file, 
# filter output from pattern list,
# test if dir exists, if yes print line
awk -F, '$1 ~ /^[0-9]{6}-100(_2)?$/ && $3 ~ /X/' CSV_file | grep -f patterns | awk -F, 'system("test -d "$1) == 0'
I'm not too good at awk, I wouldn't be surprised if someone else post a better code
If I had to choose, I'd write a perl script and store patterns in hash, should be faster
Could you walk me through what the syntax does. I sort of have an idea but I would like to understand it better.
Also, if I don't want to test if the directory exists, just want to match if those numbers are in the csv file, I would take out the "awk -F 'system("test -d" bit correct?

I would think there'd be an easier way to do this. ...alas I'm still learning all of this.
 
Old 12-29-2017, 06:00 PM   #6
keefaz
LQ Guru
 
Registered: Mar 2004
Distribution: Slackware
Posts: 5,681

Rep: Reputation: 507Reputation: 507Reputation: 507Reputation: 507Reputation: 507Reputation: 507
Depending on your setup, yes there are faster ways.
You're correct on removing the last awk for skipping the directory test.

The major part of a solution is a well understood problem.
I admit it's not very efficient as it, if I had a precise idea about directory structure, subdirectories names, amount of directories, real lines examples from CSV file, exact requested goal etc I would do things differently

Code:
# add a ^ character in front of each line of file1
# so we can use a more restrictive regexp pattern
# -> search only the lines that start with the pattern in CVS_file
#
# then save in patterns file
sed 's/.*/^&/' file1 > patterns
Code:
# use comma as field separator, 
# test if the first field contains a 6-digits number
# followed by -100, followed by _2 (or not)
awk -F, '$1 ~ /^[0-9]{6}-100(_2)?$/
Code:
# test if field 3 contains X
&& $3 ~ /X/' CSV_file
If both tests succeed, awk will print the line to screen(implicit)

Code:
# test if outputed lines from previous command 
# match against regexp in pattern file
 | grep -f patterns
Code:
# test if a directory named with first field value exists, 
# if yes print the line (implicit)
 | awk -F, 'system("test -d "$1) == 0'

Last edited by keefaz; 12-29-2017 at 06:08 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to grep the first occurrence of a date string in a file and print the file name and directory onthetopo Linux - Newbie 12 10-22-2017 03:08 PM
Extracting rows from one file based on column entries in another file mphillips67 Linux - Newbie 3 05-06-2014 07:26 PM
Bash read each file in a directory and match with pattern threeonethree Programming 25 11-29-2010 12:30 PM
[SOLVED] Bash Script; Sort files into directory based on data in the file name MTAS Programming 31 10-06-2010 12:47 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:10 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration