LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 02-20-2016, 08:53 PM   #1
qombi
Member
 
Registered: Jan 2016
Posts: 34

Rep: Reputation: Disabled
Bash Script Help Request


I am new at writing bash scripts. Would anyone please assist me in writing one? I would like to compare two files and determine if lines are not present in one of them but are in the other. The lines are not sequential and I would like to ignore the first word in each line.

Example:

FileA:
Code:
created test1
created test2
created test3
created test4
FileB:
Code:
same test1
same test2
created test5
If I execute command:

Code:
grep -Fvf FileB FileA
Since the first word in the lines are different, the output from the command is:

Code:
created test1
created test2
created test3
created test4
The results I would like would be:

Code:
created test3
created test4
Any way to ignore the first word in each line in the file being compared to, in this example FileA?

Thanks!

Last edited by qombi; 02-20-2016 at 08:58 PM.
 
Old 02-20-2016, 09:35 PM   #2
hydrurga
LQ Guru
 
Registered: Nov 2008
Location: Pictland
Distribution: Linux Mint 21 MATE
Posts: 8,048
Blog Entries: 5

Rep: Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925
A quick search on the interwebs for "linux diff specific column" came up with this answered question:

http://unix.stackexchange.com/questi...lumn-in-a-file

With a little tweaking, the top answer solves your query, but I'll let you work out what that tweaking is. ;-)
 
Old 02-20-2016, 09:50 PM   #3
Habitual
LQ Veteran
 
Registered: Jan 2011
Location: Abingdon, VA
Distribution: Catalina
Posts: 9,374
Blog Entries: 37

Rep: Reputation: Disabled
https://duckduckgo.com/?q=files+diff...xquestions.org

Ask if you have issues applying any solution from that search.
 
Old 02-21-2016, 10:10 AM   #4
qombi
Member
 
Registered: Jan 2016
Posts: 34

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by hydrurga View Post
A quick search on the interwebs for "linux diff specific column" came up with this answered question:

http://unix.stackexchange.com/questi...lumn-in-a-file

With a little tweaking, the top answer solves your query, but I'll let you work out what that tweaking is. ;-)
Thank you, this script looks like it may work for what I need. Please forgive me when I say though I do not understand how it works even after reading the man page for awk and reading the example given in the link. I am definitely not a programmer. I would like to exclude I assume the first column from my original example and compare all other columns. Also the lines are not necessarily sequential either. I would like to look for any identical lines matching any in the other document. How would you do that? I did not come straight to these forums without Google searches I promise! I didn't even original know what an intelligent search would be for the task at hand!

Also the columns in the text file I want to compare have a space before them, will that affect what is considered a column? Thanks for the help.

Code:
$ awk 'NR==FNR{c[$2]++;next};c[$2] == 0' filea fileb
This looks like it worked for my original example, but not on the real two files I want to compare. Anyone know of "for dummies" website that explains what each part of the formula does? I would like to understand it but it is over my head.

Last edited by qombi; 02-21-2016 at 10:25 AM.
 
Old 02-21-2016, 12:36 PM   #5
hydrurga
LQ Guru
 
Registered: Nov 2008
Location: Pictland
Distribution: Linux Mint 21 MATE
Posts: 8,048
Blog Entries: 5

Rep: Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925
A search for "awk comparing two files" came up with this interesting YouTube video looking at how to use FNR and NR:

https://www.youtube.com/watch?v=hnT4WTz9dR8

Given that your real-life date isn't the same as the example you gave, perhaps you should show some sample real-life data, anonymising any sensitive data?
 
Old 02-21-2016, 12:39 PM   #6
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 2,792

Rep: Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201Reputation: 1201
The columns are (white-)space separated; there MUST be a space.
The first given link tries to explain the code.
It also tells to run it twice.
Code:
awk 'NR==FNR {c[$2]++; next}; !($2 in c)' filea fileb
awk 'NR==FNR {c[$2]++; next}; !($2 in c)' fileb filea
I have corrected the lookup in the array c a bit.
The lookup is an implicit if clause, and no { action } means an implicit { print }.
--
Please give an example of a not working input!
 
Old 02-21-2016, 12:48 PM   #7
hydrurga
LQ Guru
 
Registered: Nov 2008
Location: Pictland
Distribution: Linux Mint 21 MATE
Posts: 8,048
Blog Entries: 5

Rep: Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925
Quote:
Originally Posted by MadeInGermany View Post
The columns are (white-)space separated; there MUST be a space.
The first given link tries to explain the code.
It also tells to run it twice.
Code:
awk 'NR==FNR {c[$2]++; next}; !($2 in c)' filea fileb
awk 'NR==FNR {c[$2]++; next}; !($2 in c)' fileb filea
I have corrected the lookup in the array c a bit.
The lookup is an implicit if clause, and no { action } means an implicit { print }.
--
Please give an example of a not working input!
In the OP's original example, the output implied that the only comparison required was columns in FileA that weren't in FileB. Therefore, the code only needed to be run once, with the parameters in the order FileB FileA.

However, given that there is added complexity to the OP's real-life data which wasn't reflected in the example he/she gave, the real solution will probably prove to be slightly more complex itself.
 
Old 02-22-2016, 06:15 PM   #8
qombi
Member
 
Registered: Jan 2016
Posts: 34

Original Poster
Rep: Reputation: Disabled
I really appreciate your patience with me. I have pasted below the actual sample of data from files being compared. I should have done so to begin with, I apologize! I have posted the results below after executing the awk command.

FILEA

Code:
  create d 775   1000/63    11030238 share/
  create   775   1000/63    11030238 share/taxes2008 - jan 03.pdf
  create   775   1000/63    11030238 share/taxes2009 - jan 10.pdf
  create   775   1000/63    11030134 share/taxes2010 - feb 01.pdf
  create   775   1000/63          23 share/recipes.txt
  create   775   1000/63    11030238 share/taxes2011 - jan 12.pdf
  create   775   1000/63    11030134 share/taxes2012 - jan 04.pdf
  create   775   1000/63    11030238 share/taxes2013 - jan 12.pdf
  create   775   1000/63    11028566 share/taxes2014 - feb 23.pdf
  create   775   1000/63    10754698 share/taxes2015 - jan 02.pdf
FILEB

Code:
  create d2775 1000/1016        4096 share/limited/docs
  create   775 1000/1016      953978 share/limited/docs/HomeInspection12710.mht
  create   775 1000/1016       90869 share/limited/docs/PassportApplicationComplete.pdf
  create   775 1000/1016       60416 share/limited/docs/Rec Letter.doc
  create   775 1000/1016      128525 share/limited/docs/Structural_Eng_Report.pdf
  create d 775   1000/63    11030238 share/
  same     775   1000/63    11030238 share/taxes2008 - jan 03.pdf
  same     775   1000/63    11030238 share/taxes2009 - jan 10.pdf
  same     775   1000/63    11030134 share/taxes2010 - feb 01.pdf
  same     775   1000/63    11030238 share/taxes2011 - jan 12.pdf
  same     775   1000/63    11030134 share/taxes2012 - jan 04.pdf
  same     775   1000/63    11030238 share/taxes2013 - jan 12.pdf
  same     775   1000/63    11028566 share/taxes2014 - feb 23.pdf
  same     775   1000/63    10754698 share/taxes2015 - jan 02.pdf
I would like to know that this line below no longer exists by comparing FILEB to FILEA. The first column I wish to ignore so I do not display lines that changed from create to same. Also notice that each line isn't sequential.

Code:
create   775   1000/63          23 share/recipes.txt
Code:
awk 'NR==FNR{c[$2]++;next};c[$2] == 0' fileb filea

awk 'NR==FNR{c[$2]++;next};c[$2] == 0' filea fileb
  create d2775 1000/1016        4096 share/limited/docs
Worked in my simple previous example but not with the actual data. I am watching the youtube video in link provided for awk, it's interesting indeed. Thanks again for everyone that has posted so far.

Last edited by qombi; 02-22-2016 at 06:35 PM.
 
Old 02-22-2016, 06:44 PM   #9
hydrurga
LQ Guru
 
Registered: Nov 2008
Location: Pictland
Distribution: Linux Mint 21 MATE
Posts: 8,048
Blog Entries: 5

Rep: Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925
Before everything else, qombi, the "d" and "d2" could prove difficult. They appear to be part of a column that doesn't always have data, thus meaning for example that in

Code:
  create d 775   1000/63    11030238 share/
  create   775   1000/63    11030238 share/taxes2008 - jan 03.pdf
the 775 in the first line is in the third column, whereas the 775 in the second line is in the second column, unless the field separators are actually tabs which we can't see.

More than that, the first line in FileB appears to have no space on the first line between the d2 and the 775.

The first problem with data comparison like this is often to clean the data up and ensure that it is consistent.

We can get around the variable columns issue if the 775 always starts on the same character position on each line - is this the case?

Or, if the only thing that changes between a corresponding line for the same file in FileA and FileB is "create" and "same", then we can carry out a sed before doing the comparison.

Is the filename always unique and the rest of the line static (apart from the create/same), so that we can just compare the filenames and forget the rest? In other words, how much do we really need to check from corresponding lines to be sure that they are the same or that one doesn't exist?
 
Old 02-22-2016, 09:40 PM   #10
qombi
Member
 
Registered: Jan 2016
Posts: 34

Original Poster
Rep: Reputation: Disabled
Quote:
hydrurga;5504524]Before everything else, qombi, the "d" and "d2" could prove difficult. They appear to be part of a column that doesn't always have data, thus meaning for example that in

Code:
  create d 775   1000/63    11030238 share/
  create   775   1000/63    11030238 share/taxes2008 - jan 03.pdf
the 775 in the first line is in the third column, whereas the 775 in the second line is in the second column, unless the field separators are actually tabs which we can't see.

More than that, the first line in FileB appears to have no space on the first line between the d2 and the 775.

The first problem with data comparison like this is often to clean the data up and ensure that it is consistent.
Quote:
We can get around the variable columns issue if the 775 always starts on the same character position on each line - is this the case?
The
Code:
775
always starts in the same character position however the numbers could be different. The 775 is the file's permissions. If the original value did not exist I would wish for it to report that line as missing. The only data I wish not to know any differences between would be the first column. To provide some background information, these are logs generated from backup software. It recreates the directory structure during each backup, any files that are identical will be listed as
Code:
same
or
Code:
pool
. Any new files will be listed as
Code:
create
and directories will always be listed as
Code:
create d
. So when a new file is added it's listed as
Code:
create
in log file filea, the next backup if that same file still exist it will be listed as
Code:
pool
or
Code:
same
which I do not wish to know about, only any following columns.

Quote:
Or, if the only thing that changes between a corresponding line for the same file in FileA and FileB is "create" and "same", then we can carry out a sed before doing the comparison.
If any of the other data for other columns changed, I would like it to be outputted the original line does not exist so I am aware of the change. The only values that could be in the first column would be create, pool, same which I would like to ignore if the rest of the line is equal.

Quote:
Is the filename always unique and the rest of the line static (apart from the create/same), so that we can just compare the filenames and forget the rest? In other words, how much do we really need to check from corresponding lines to be sure that they are the same or that one doesn't exist?
Other fields have the possibility of changing, or I should say if they do change I would like to know the previous line is missing. I hope that makes sense. Here is an example:

I would like to compare fileb to filea and learn if any bolded information below has changed.

filea

Code:
  
create d 775   1000/63    11030238 share/
create   775   1000/63    11030238 share/taxes2008 - jan 03.pdf
create   775   1000/63    11030238 share/taxes2009 - jan 03.pdf
fileb

Code:
  
create d 664   1000/63    11030238 share/
same     775   1000/63    11030238 share/taxes2008 - jan 03.pdf
create   775   1000/63    99999999 share/taxes2009 - jan 03.pdf
In this example after comparison, we find that these lines have changes we care about while ignoring that information has changed in the first column which we do not care about. The output wished for would be these lines are no longer present in filea:


Code:
create d 775   1000/63    11030238 share/
create   775   1000/63    11030238 share/taxes2009 - jan 03.pdf
I hope that explains the intent and a little more about the situation.

Last edited by qombi; 02-22-2016 at 09:46 PM.
 
Old 02-23-2016, 12:16 AM   #11
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,842

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
so you don't mind the first 9 chars. In that case you need to remove them, because that awk script works on full lines and also cannot really handle that d (if exists or missing).
As a simple solution you can use the command cut:
Code:
cut -c 10- input_file > output_file
and run that awk script on those files. If that works for you you can construct a shell script to put all those things together or add that functionality to that awk.
 
Old 02-23-2016, 04:37 AM   #12
hydrurga
LQ Guru
 
Registered: Nov 2008
Location: Pictland
Distribution: Linux Mint 21 MATE
Posts: 8,048
Blog Entries: 5

Rep: Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925
Given that the file permissions start at the same character position (offset 10) each time, and you want to compare everything else in the line starting at that point, the original awk can be modified to work on a substring of each line:

Code:
awk 'NR==FNR{c[substr($0,10)]++;next};c[substr($0,10)] == 0' fileb filea
Let us know how it gets on.
 
Old 02-23-2016, 04:58 AM   #13
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,842

Rep: Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308Reputation: 7308
do you mean substr($0, 10, 3) ?
 
Old 02-23-2016, 05:06 AM   #14
hydrurga
LQ Guru
 
Registered: Nov 2008
Location: Pictland
Distribution: Linux Mint 21 MATE
Posts: 8,048
Blog Entries: 5

Rep: Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925Reputation: 2925
Quote:
Originally Posted by pan64 View Post
do you mean substr($0, 10, 3) ?
No, because we're comparing the rest of the line, not just the file attributes.
 
Old 02-23-2016, 08:18 AM   #15
qombi
Member
 
Registered: Jan 2016
Posts: 34

Original Poster
Rep: Reputation: Disabled
Thumbs up

Quote:
Originally Posted by hydrurga View Post
Given that the file permissions start at the same character position (offset 10) each time, and you want to compare everything else in the line starting at that point, the original awk can be modified to work on a substring of each line:

Code:
awk 'NR==FNR{c[substr($0,10)]++;next};c[substr($0,10)] == 0' fileb filea
Let us know how it gets on.
Works beautifully! Thanks to everyone who replied. I appreciate the help. My goal is to understand awk command syntax so I know why this works and to be able to use this command in the future.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] help bash script to send request to url and re-writing the result url jmishal Programming 3 04-16-2015 04:46 PM
Request for bash Script to comment Paragraph with a particular occourance. nidhintomson Programming 11 09-10-2014 03:12 PM
Bash Help Request HunterS Programming 6 02-06-2012 04:51 AM
Bash Shell Scripting Help Request Bouldernative Linux - Newbie 13 03-03-2011 01:24 AM
SSH connection from BASH script stops further BASH script commands tardis1 Linux - Newbie 3 12-06-2010 08:56 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 03:48 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration