[SOLVED] Bash Script Help Request

qombi · 02-20-2016, 08:53 PM

I am new at writing bash scripts. Would anyone please assist me in writing one? I would like to compare two files and determine if lines are not present in one of them but are in the other. The lines are not sequential and I would like to ignore the first word in each line.

Example:

FileA:

Code:

created test1
created test2
created test3
created test4

FileB:

Code:

same test1
same test2
created test5

If I execute command:

Code:

grep -Fvf FileB FileA

Since the first word in the lines are different, the output from the command is:

Code:

created test1
created test2
created test3
created test4

The results I would like would be:

Code:

created test3
created test4

Any way to ignore the first word in each line in the file being compared to, in this example FileA?

Thanks!

hydrurga · 02-20-2016, 09:35 PM

A quick search on the interwebs for "linux diff specific column" came up with this answered question:

http://unix.stackexchange.com/questi...lumn-in-a-file

With a little tweaking, the top answer solves your query, but I'll let you work out what that tweaking is. ;-)

Habitual · 02-20-2016, 09:50 PM

https://duckduckgo.com/?q=files+diff...xquestions.org

Ask if you have issues applying any solution from that search.

qombi · 02-21-2016, 10:10 AM

Quote:

Originally Posted by hydrurga

A quick search on the interwebs for "linux diff specific column" came up with this answered question:

http://unix.stackexchange.com/questi...lumn-in-a-file

With a little tweaking, the top answer solves your query, but I'll let you work out what that tweaking is. ;-)

Thank you, this script looks like it may work for what I need. Please forgive me when I say though I do not understand how it works even after reading the man page for awk and reading the example given in the link. I am definitely not a programmer. I would like to exclude I assume the first column from my original example and compare all other columns. Also the lines are not necessarily sequential either. I would like to look for any identical lines matching any in the other document. How would you do that? I did not come straight to these forums without Google searches I promise! I didn't even original know what an intelligent search would be for the task at hand!

Also the columns in the text file I want to compare have a space before them, will that affect what is considered a column? Thanks for the help.

Code:

$ awk 'NR==FNR{c[$2]++;next};c[$2] == 0' filea fileb

This looks like it worked for my original example, but not on the real two files I want to compare. Anyone know of "for dummies" website that explains what each part of the formula does? I would like to understand it but it is over my head.

hydrurga · 02-21-2016, 12:36 PM

A search for "awk comparing two files" came up with this interesting YouTube video looking at how to use FNR and NR:

https://www.youtube.com/watch?v=hnT4WTz9dR8

Given that your real-life date isn't the same as the example you gave, perhaps you should show some sample real-life data, anonymising any sensitive data?

MadeInGermany · 02-21-2016, 12:39 PM

The columns are (white-)space separated; there MUST be a space.
The first given link tries to explain the code.
It also tells to run it twice.

Code:

awk 'NR==FNR {c[$2]++; next}; !($2 in c)' filea fileb
awk 'NR==FNR {c[$2]++; next}; !($2 in c)' fileb filea

I have corrected the lookup in the array c a bit.
The lookup is an implicit if clause, and no { action } means an implicit { print }.
--
Please give an example of a not working input!

hydrurga · 02-21-2016, 12:48 PM

Quote:

Originally Posted by MadeInGermany

The columns are (white-)space separated; there MUST be a space.
The first given link tries to explain the code.
It also tells to run it twice.

Code:

awk 'NR==FNR {c[$2]++; next}; !($2 in c)' filea fileb
awk 'NR==FNR {c[$2]++; next}; !($2 in c)' fileb filea

I have corrected the lookup in the array c a bit.
The lookup is an implicit if clause, and no { action } means an implicit { print }.
--
Please give an example of a not working input!

In the OP's original example, the output implied that the only comparison required was columns in FileA that weren't in FileB. Therefore, the code only needed to be run once, with the parameters in the order FileB FileA.

However, given that there is added complexity to the OP's real-life data which wasn't reflected in the example he/she gave, the real solution will probably prove to be slightly more complex itself.

qombi · 02-22-2016, 06:15 PM

I really appreciate your patience with me. I have pasted below the actual sample of data from files being compared. I should have done so to begin with, I apologize! I have posted the results below after executing the awk command.

FILEA

Code:

  create d 775   1000/63    11030238 share/
  create   775   1000/63    11030238 share/taxes2008 - jan 03.pdf
  create   775   1000/63    11030238 share/taxes2009 - jan 10.pdf
  create   775   1000/63    11030134 share/taxes2010 - feb 01.pdf
  create   775   1000/63          23 share/recipes.txt
  create   775   1000/63    11030238 share/taxes2011 - jan 12.pdf
  create   775   1000/63    11030134 share/taxes2012 - jan 04.pdf
  create   775   1000/63    11030238 share/taxes2013 - jan 12.pdf
  create   775   1000/63    11028566 share/taxes2014 - feb 23.pdf
  create   775   1000/63    10754698 share/taxes2015 - jan 02.pdf

FILEB

Code:

  create d2775 1000/1016        4096 share/limited/docs
  create   775 1000/1016      953978 share/limited/docs/HomeInspection12710.mht
  create   775 1000/1016       90869 share/limited/docs/PassportApplicationComplete.pdf
  create   775 1000/1016       60416 share/limited/docs/Rec Letter.doc
  create   775 1000/1016      128525 share/limited/docs/Structural_Eng_Report.pdf
  create d 775   1000/63    11030238 share/
  same     775   1000/63    11030238 share/taxes2008 - jan 03.pdf
  same     775   1000/63    11030238 share/taxes2009 - jan 10.pdf
  same     775   1000/63    11030134 share/taxes2010 - feb 01.pdf
  same     775   1000/63    11030238 share/taxes2011 - jan 12.pdf
  same     775   1000/63    11030134 share/taxes2012 - jan 04.pdf
  same     775   1000/63    11030238 share/taxes2013 - jan 12.pdf
  same     775   1000/63    11028566 share/taxes2014 - feb 23.pdf
  same     775   1000/63    10754698 share/taxes2015 - jan 02.pdf

I would like to know that this line below no longer exists by comparing FILEB to FILEA. The first column I wish to ignore so I do not display lines that changed from create to same. Also notice that each line isn't sequential.

Code:

create   775   1000/63          23 share/recipes.txt

Code:

awk 'NR==FNR{c[$2]++;next};c[$2] == 0' fileb filea

awk 'NR==FNR{c[$2]++;next};c[$2] == 0' filea fileb
  create d2775 1000/1016        4096 share/limited/docs

Worked in my simple previous example but not with the actual data. I am watching the youtube video in link provided for awk, it's interesting indeed. Thanks again for everyone that has posted so far.

hydrurga · 02-22-2016, 06:44 PM

Before everything else, qombi, the "d" and "d2" could prove difficult. They appear to be part of a column that doesn't always have data, thus meaning for example that in

Code:

  create d 775   1000/63    11030238 share/
  create   775   1000/63    11030238 share/taxes2008 - jan 03.pdf

the 775 in the first line is in the third column, whereas the 775 in the second line is in the second column, unless the field separators are actually tabs which we can't see.

More than that, the first line in FileB appears to have no space on the first line between the d2 and the 775.

The first problem with data comparison like this is often to clean the data up and ensure that it is consistent.

We can get around the variable columns issue if the 775 always starts on the same character position on each line - is this the case?

Or, if the only thing that changes between a corresponding line for the same file in FileA and FileB is "create" and "same", then we can carry out a sed before doing the comparison.

Is the filename always unique and the rest of the line static (apart from the create/same), so that we can just compare the filenames and forget the rest? In other words, how much do we really need to check from corresponding lines to be sure that they are the same or that one doesn't exist?

qombi · 02-22-2016, 09:40 PM

Quote:

hydrurga;5504524]Before everything else, qombi, the "d" and "d2" could prove difficult. They appear to be part of a column that doesn't always have data, thus meaning for example that in

Code:

  create d 775   1000/63    11030238 share/
  create   775   1000/63    11030238 share/taxes2008 - jan 03.pdf

the 775 in the first line is in the third column, whereas the 775 in the second line is in the second column, unless the field separators are actually tabs which we can't see.

More than that, the first line in FileB appears to have no space on the first line between the d2 and the 775.

The first problem with data comparison like this is often to clean the data up and ensure that it is consistent.

Quote:

We can get around the variable columns issue if the 775 always starts on the same character position on each line - is this the case?

The

Code:

always starts in the same character position however the numbers could be different. The 775 is the file's permissions. If the original value did not exist I would wish for it to report that line as missing. The only data I wish not to know any differences between would be the first column. To provide some background information, these are logs generated from backup software. It recreates the directory structure during each backup, any files that are identical will be listed as

Code:

same

or

Code:

pool

. Any new files will be listed as

Code:

create

and directories will always be listed as

Code:

create d

. So when a new file is added it's listed as

Code:

create

in log file filea, the next backup if that same file still exist it will be listed as

Code:

pool

or

Code:

same

which I do not wish to know about, only any following columns.

Quote:

Or, if the only thing that changes between a corresponding line for the same file in FileA and FileB is "create" and "same", then we can carry out a sed before doing the comparison.

If any of the other data for other columns changed, I would like it to be outputted the original line does not exist so I am aware of the change. The only values that could be in the first column would be create, pool, same which I would like to ignore if the rest of the line is equal.

Quote:

Is the filename always unique and the rest of the line static (apart from the create/same), so that we can just compare the filenames and forget the rest? In other words, how much do we really need to check from corresponding lines to be sure that they are the same or that one doesn't exist?

Other fields have the possibility of changing, or I should say if they do change I would like to know the previous line is missing. I hope that makes sense. Here is an example:

I would like to compare fileb to filea and learn if any bolded information below has changed.

filea

Code:

  
create d 775   1000/63    11030238 share/
create   775   1000/63    11030238 share/taxes2008 - jan 03.pdf
create   775   1000/63    11030238 share/taxes2009 - jan 03.pdf

fileb

Code:

  
create d 664   1000/63    11030238 share/
same     775   1000/63    11030238 share/taxes2008 - jan 03.pdf
create   775   1000/63    99999999 share/taxes2009 - jan 03.pdf

In this example after comparison, we find that these lines have changes we care about while ignoring that information has changed in the first column which we do not care about. The output wished for would be these lines are no longer present in filea:

Code:

create d 775   1000/63    11030238 share/
create   775   1000/63    11030238 share/taxes2009 - jan 03.pdf

I hope that explains the intent and a little more about the situation.

pan64 · 02-23-2016, 12:16 AM

so you don't mind the first 9 chars. In that case you need to remove them, because that awk script works on full lines and also cannot really handle that d (if exists or missing).
As a simple solution you can use the command cut:

Code:

cut -c 10- input_file > output_file

and run that awk script on those files. If that works for you you can construct a shell script to put all those things together or add that functionality to that awk.

hydrurga · 02-23-2016, 04:37 AM

Given that the file permissions start at the same character position (offset 10) each time, and you want to compare everything else in the line starting at that point, the original awk can be modified to work on a substring of each line:

Code:

awk 'NR==FNR{c[substr($0,10)]++;next};c[substr($0,10)] == 0' fileb filea

Let us know how it gets on.

pan64 · 02-23-2016, 04:58 AM

do you mean substr($0, 10, 3) ?

hydrurga · 02-23-2016, 05:06 AM

Quote:

Originally Posted by pan64

do you mean substr($0, 10, 3) ?

No, because we're comparing the rest of the line, not just the file attributes.

qombi · 02-23-2016, 08:18 AM

Quote:

Originally Posted by hydrurga

Given that the file permissions start at the same character position (offset 10) each time, and you want to compare everything else in the line starting at that point, the original awk can be modified to work on a substring of each line:

Code:

awk 'NR==FNR{c[substr($0,10)]++;next};c[substr($0,10)] == 0' fileb filea

Let us know how it gets on.

Works beautifully! Thanks to everyone who replied. I appreciate the help. My goal is to understand awk command syntax so I know why this works and to be able to use this command in the future.