compare two files

haydar68 · 08-16-2008, 10:35 AM

Hi,

I want to compare 2 files and get a new output that will contain the differences. Each file contain 5 fields (matricule, first name, last name, age, profession)

file1 is the original file. file2 should be synchronized with file1:I want to look for any change on file1 and want to apply these changes on file2.

#cat file1
10000;john;Trad;40;teacher
10001;georges;Hold;34;physician
10002;Catherina;Rick;36;doctor
10003;marc;bob;46;techician

#cat file2
10000;john;Trad;40;teacher
10001;georges;Hold;40;physician
10003;marc;Robert;46;programmer
10004;Maria;Roch;39;nurse

I us this script:

awk 'NR==FNR {f1[$0]=$0}
NR!=FNR {f2[$0]=$0}
END {
for(i in f1) if(!(i in f2)) print "Only in f1: " f1[i]
for(i in f2) if(!(i in f1)) print "Only in f2: " f2[i]
}' file1 file2

I get this result:

==============
Only in f1: 10001;georges;Hold;34;physician
Only in f1: 10003;marc;bob;46;techician
Only in f1: 10002;Catherina;Rick;36;doctor
Only in f2: 10004;Maria;Roch;39;nurse
Only in f2: 10003;marc;Robert;46;programmer
Only in f2: 10001;georges;Hold;40;physician
==============

But it is not what I hope to get and obtain as result.
I want to get a result like that:

===========
matricule:10001
change: modified
age:40

matricule:10002
change: deleted

matricule:10003
change: modified
lastname: Robert
profession: programmer

matricule:10004
change: added
firstname:Maria
lastaname:Roch
age:39
profession:nurse
==========

Can someone help me to get this result with awk?

Thanks,

Haydar

CRC123 · 08-16-2008, 10:44 AM

Try writing a script with the 'diff' command. It takes two files as input and then reports the differences between them, if there are any. Read the man page for it.

haydar68 · 08-16-2008, 10:46 AM

Quote:

Originally Posted by CRC123

Try writing a script with the 'diff' command. It takes two files as input and then reports the differences between them, if there are any. Read the man page for it.

Thanks for your suggestion, I know how to use diff, but I need to use awk, awk is a simple command to run fast than diff/grep to provide the result that I need.

jschiwal · 08-16-2008, 10:51 AM

This seems very contrived and makes me think this is a homework question. The sample you posted doesn't look in individual records at all, so it doesn't seem that you even wrote it yourself. If the first file is being read then 'NR==FNR' will be true. The logic in the END section tests if the records saved in the array differ. You need to change what you do if they differ and test which fields differ in that case.

pixellany · 08-16-2008, 11:16 AM

Quote:

file2 should be synchronized with file1:I want to look for any change on file1 and want to apply these changes on file2.

This suggests that you could just copy file1 to file2.
Perhaps what you meant to say is that, if file1 has data for a particular field which is different than the corresponding field in file2 (if it exists), then that field in file2 should be updated.

And, yes, why does it have to be AWK?

haydar68 · 08-16-2008, 11:34 AM

Quote:

Originally Posted by pixellany

This suggests that you could just copy file1 to file2.
Perhaps what you meant to say is that, if file1 has data for a particular field which is different than the corresponding field in file2 (if it exists), then that field in file2 should be updated.

And, yes, why does it have to be AWK?

Hi Pixellany,

I agree with you to copy file1 to file2.

But my goal is to track the changes that were done in file1. I did not find any link where it explains clearly how to manipulate 2 files and their fields by using awk.

Thanks for your comments,

Haydar

jiml8 · 08-16-2008, 11:57 AM

The comm command is exactly what is required here.

man comm

jschiwal · 08-16-2008, 12:17 PM

Since a record in one file may be missing in another file, you may want to create two arrays as you are doing, but use the first field as the index instead of the record number. Life might be easier if both files are sorted by the first field as well. The sort command can guarantee that if it might not be the case in the files.
awk -f commands.awk <(sort -t; file1) <(sort -t; file2)

Since your report is only concerned with the difference, you could use the "comm" command to filter out common lines:
comm -23 <(sort -t; file1) >temp1
comm -13 <(sort -t; file2) >temp2
awk -f commands.awk temp1 temp2 >report

Also, remember that awk arrays are one-dimensional. That means that you can't have a two dimensional array of records/fields. You will either have to decompose each field manually (in the END section logic) instead of using $1, $2, etc.; Or assign the values of an array to $0 and then create a temporary array for file1, before assigning the corresponding array element value (for file2) to $0 from the cooresponding line from the second file.

Awk arrays are associative, so the index can be a word instead of an integer. That may help. The index could be lastname or profession. That will make your awk program easier to read.

Often in Unix/Linux, your best approach is to use small tools like grep, sort and comm, each doing part of the job. Comm only works on sorted files, so that is a given. Working with only entries that differ means that the arrays can be smaller in awk as well.

estabroo · 08-16-2008, 03:36 PM

cat file1 file2 | perl -e 'while (<>) { ($key,$val) = split(/;/,$_,2); $keep{$key} = $val; }; foreach $key (sort(keys(%keep))) { print "$key;$keep{$key}" }'

just reverse the cat if you want the file precedence the other way. cat file1 file2 means that items in file2 will take the place of items in file1, which by your example looks like what you wanted.