Comparing two linux files for diffirences and similarities.

secondchanti · 07-26-2010, 10:41 AM

Dear friends,

Iam having the following two linux files.

file :1

123
456
789
987
654
321

file :2

123
258
236
456
458
658
987
321
568
963
458
758
854
569

Now i want the following out puts

1. similar nos in both the file 1 and file 2 > output= File 3;
2. In file 1, but not in file 2 > out put= file 4;
3. In file 2, but not in file 1 > output = file 5;

The command sdiff is giving output with symbols > < | etc,
and the such output file is not clear and ready to print.

I want to print directly the output files.

PL SUGGEST ME THE SUITABLE COMMANDS OR AWK COMMANDS.

AND

ALSO TELL ME WHERE I HAVE TO WRITE AWK PROGRAMS AND HOW TO RUN IT.

HELP ME

RAO

GrapefruiTgirl · 07-26-2010, 10:54 AM

Sounds like all you need is the plain old `diff` command, perhaps with the --GTYPE-group-format=GFMT option.

Get file one but not in file 2:

Code:

diff %< file file > output

Get file two but not file one:

Code:

diff %> file file > output

Get stuff common to both files:

Code:

diff %= file file > output

NOTE: I've never used this option, so it may not work exactly as I've written - try it and see.

There are other ways of getting only one file's different lines, still using diff. `diff` also has lots of options for formatting the output - read the man page for details, and experiment with it.

If you want to use `awk`, either write an awk script (a plain text file basically) using a shebang like #!/usr/bin/awk -f or if you wish, just write a bash script (again, basically a text file) with a shebang like #!/bin/bash and within the bash script, send data into `awk` either via a pipe, or by telling awk to read the file you want to operate on. Both methods (the scripts) can be executed from your console terminal.

P.S. - if `diff` alone is not producing precisely the output you want (like if it still has < or > symbols you don't want) then pipe the output through something like `sed` or `tr` to remove unwanted characters.

EDIT: Added info:

Code:

diff --left-column file1 file2     # show only file1 stuff
diff --left-column file2 file1     # show only file2 stuff

secondchanti · 07-26-2010, 11:43 AM

It is not working. it is showing to go to help --diff.

Pl guide me

Rao

b0uncer · 07-26-2010, 01:30 PM

I couldn't get the diff produce the wanted output either, quickly enough, so I wrote a small&ugly perl script to do the job. Here goes...

Code:

#!/usr/bin/perl

use strict;

# Hashes to store the lines (numbers) from the files.

my (%hash1,
    %hash2,
    %hash3);

# Now read the two files into hashes 1 and 2; hash 3 will be
# filled with values that exist in both hash 1 and 2.

open(FILE, 'file1') or die('Could not open file1');
while (<FILE>) {
	$hash1{$_} = 1;
}
close(FILE);

open(FILE, 'file2') or die('Could not open file2');
while (<FILE>) {
	$hash2{$_} = 1;
}
close(FILE);

# Go through keys in hash 1 (lines of 1st file).
# Write those to OUTPUT1 that don't exist in hash 2 (2nd file).
# Those that do exist in hash 2 are added to hash 3.
# Do the same the other way around, writing to OUTPUT2, after
# which hash3 contains lines that exist in both of the files.
# Then just write the keys of hash 3 to OUTPUT3 and close files.

open(OUTPUT, '>output1') or die ('Could not open output1');
for (keys %hash1) {
	if ($hash2{$_} eq undef) {
		print OUTPUT1 $_;	
	}
	else {
		$hash3{$_} = 1;
	}
}
close(OUTPUT);

open(OUTPUT, '>output2') or die ('Could not open output2');
for (keys %hash2) {
	if ($hash1{$_} eq undef) {
		print OUTPUT2 $_;
	}
	else {
		$hash3{$_} = 1;
	}
}
close(OUTPUT);

open(OUTPUT, '>output3') or die ('Could not open output3');
for (keys %hash3) {
	print OUTPUT3 $_;
}
close(OUTPUT3);

It doesn't produce sorted output (but you'll figure out how to do that yourself, don't you?

), but it does produce three files (output1--output3), of which one contains the unique items in file1, one the unique items in file2 and one the items that exist in both of the files. I'm confident that the above code can be made a lot shorter, and that diff probably works faster in the hands of somebody who knows how to use it, but this shows one way too. At least it worked on my test files -- hope it helps a little, if nothing else

grail · 07-26-2010, 08:16 PM

So it turns out the diff stuff sasha put up does work although I had to do an extended version (Lines from post #2 did not work for me as is):

Code:

#diff %< file file > output
diff --changed-group-format='%<' --unchanged-group-format='' file1 file2 > output

#diff %> file file > output
diff --changed-group-format='%>' --unchanged-group-format='' file1 file2 > output

#diff %= file file > output
diff --changed-group-format='' --unchanged-group-format='%=' file1 file2 > output

b0uncer · 07-27-2010, 12:37 AM

Right, so I was missing the 2nd option all the way. Thanks to grail for updating my knowledge by another piece

It appears this work workstation, having a different variant of Linux than at home, has an older man page for (also older) diff, which is somewhat easier to understand on this part (or then it's just easier after getting it working). However the man page does not define the format for the options in any way, which is odd because running

Code:

diff --help

does; glad this is an older system people don't use much these days. And apparently the diff on the SunOS here doesn't work that way at all, so it's another dead end. Luckily they all have Perl