diff 2 files by content keywords, not lines!

WiseDraco · 12-13-2016, 05:10 AM

Hello!
I have to compare two text files.

it looks like that:

WRLV70RPVE
WIBY40UAMN
WALT11EXHM
WALT12EXHM
WELT14ERHM
WTLT20ENHM

both files have one "keyword" in a line, but
part of it can differs, part can duplicate on both files.

need to compare both files, and output, what keywords is the same on both
files, and what keywords is unical in File 1 comparing with File2, and vice versa.

as i understand, diff, nor wdiff cant do that - it can compare files only line by line, not by chaotic words...?

Turbocapitalist · 12-13-2016, 05:19 AM

The utility comm can do that. Look at the manual page for the different options.

Code:

comm -1 -2 <(sort -u file1) <(sort -u file2)

hydrurga · 12-13-2016, 05:27 AM

What you could do is first sort the two files (with -u, if required, to weed out duplicates), then use comm to produce the report you require.

Edit. :-) Pipped to the post.

WiseDraco · 12-13-2016, 05:32 AM

Quote:

Originally Posted by hydrurga

What you could do is first sort the two files (with -u, if required, to weed out duplicates), then use comm to produce the report you require.

Edit. :-) Pipped to the post.

sort do nothing for this task, because keywords quantity differ, as result, sort what you want, there anyway always be different line number on the same keywords.

day, sorted beginning on one file was:

AAC
AAT
ABI
ABL
ADO

and other file was:

AAA
AAB
AAT
ABA
ABC

...

WiseDraco · 12-13-2016, 05:35 AM

Quote:

Originally Posted by Turbocapitalist

The utility comm can do that. Look at the manual page for the different options.

Code:

comm -1 -2 <(sort -u file1) <(sort -u file2)

as i said, that task cant be done comparing files line by line, because its positions differ, and contents, too, in part, differ.
sort what way you want, there always be the same words for both files, who have different line numbers....
i need compare not by position ( line number), but by existing, or not existing a word ( code) in the whole file.

hydrurga · 12-13-2016, 05:36 AM

Quote:

Originally Posted by WiseDraco

sort do nothing for this task, because keywords quantity differ, as result, sort what you want, there anyway always be different line number on the same keywords.

day, sorted beginning on one file was:

AAC
AAT
ABI
ABL
ADO

and other file was:

AAA
AAB
AAT
ABA
ABC

...

What do line numbers have to do with it? Did you try using comm, and/or Turbocapitalist's neater suggestion using it?

WiseDraco · 12-13-2016, 05:38 AM

Quote:

Originally Posted by hydrurga

What do line numbers have to do with it? Did you try using comm, and/or Turbocapitalist's neater suggestion using it?

NAME
comm - compare two sorted files line by line

Turbocapitalist · 12-13-2016, 05:39 AM

Quote:

Originally Posted by WiseDraco

i need compare not by position ( line number), but by existing, or not existing a word ( code) in the whole file.

That's what comm does. The sort instances are there to generate the unique list for each file. Then comm can tell you which strings are in both files, or just in one or the other, depending on the options given.

WiseDraco · 12-13-2016, 05:45 AM

Quote:

Originally Posted by Turbocapitalist

That's what comm does. The sort instances are there to generate the unique list for each file. Then comm can tell you which strings are in both files, or just in one or the other, depending on the options given.

it means, it look for exact words, not important, what is it position in file?
example:

file1:

AAV
AAR
ABT
ATI

file2:

ATI
AWO
AYY
AZZ

it compares right, and give me, the word ATI is in both files?

i understand right?

try to understand output of your given example, but there is a lot of text, and i cant fast see, what it works...

Turbocapitalist · 12-13-2016, 05:50 AM

Quote:

Originally Posted by WiseDraco

it compares right, and give me, the word ATI is in both files?

Yes. The example above with the options -1 and -2 finds only the words which are common to both files.

WiseDraco · 12-13-2016, 05:57 AM

Quote:

Originally Posted by Turbocapitalist

Yes. The example above with the options -1 and -2 finds only the words which are common to both files.

yes, i prove ir. great, thank you very much!

there is possible get output too for first column is words, who is in first file, in second columnn - in second file, and if word is in both files, then that word in the same output line one against others, whereas the single words have a empty position in correspond column?

hope, my idea can be understand.
if that can be done, that was supergreat

WiseDraco · 12-13-2016, 05:59 AM

Code:

bash-4.3$ comm -12 <(sort -u file1) <(sort -u file2)

ATI
bash-4.3$ 

bash-4.3$ comm -3 <(sort -u file1) <(sort -u file2)
	 
AAR
AAV
ABT
	AWO
	AYY
	AZZ
bash-4.3$ 

be super, if output can be given in such:


AAR
AAV
ABT
ATI	ATI
        AWO
	AYY
	AZZ

hydrurga · 12-13-2016, 06:06 AM

Quote:

Originally Posted by WiseDraco

Code:

bash-4.3$ comm -12 <(sort -u file1) <(sort -u file2)

ATI
bash-4.3$ 

bash-4.3$ comm -3 <(sort -u file1) <(sort -u file2)
	 
AAR
AAV
ABT
	AWO
	AYY
	AZZ
bash-4.3$ 

be super, if output can be given in such:


AAR
AAV
ABT
ATI	ATI
        AWO
	AYY
	AZZ

Have you tried comm with no -1/2/3 parameters? It's not quite what you want but it's close enough. If you specifically want your desired level of customisation then you'll probably have to write your own Bash script or use a programming language to do the job.

WiseDraco · 12-13-2016, 06:08 AM

Quote:

Originally Posted by hydrurga

Have you tried comm with no -1/2/3 parameters? It's not quite what you want but it's close enough. If you specifically want your desired level of customisation then you'll probably have to write your own Bash script or use a programming language to do the job.

yes, without parameters output is ok for my hopes, thank you too!
Thank you both, guys, and have a nice day!

You re super!

Turbocapitalist · 12-13-2016, 06:13 AM

Quote:

Originally Posted by WiseDraco

be super, if output can be given in such:

I think that to get that you'd have to write a very short script to check the files separately and then merge and sort the results. It would probably need use of a temporary file. (You can use tempfile to safely generate one.)