LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   Comparing two files and looking for the same line based on field - awk (http://www.linuxquestions.org/questions/programming-9/comparing-two-files-and-looking-for-the-same-line-based-on-field-awk-919187/)

sopier 12-17-2011 06:21 AM

Comparing two files and looking for the same line based on field - awk
 
I have two files:

newfile:
Code:

dd7bec79dc95fe49af149a82c9ce092e;GettingStartedWithUbuntu.pdf
13dbbcb2fd46be8d5858a3f69c5d55ab;serverguide.pdf
5a160b28e7b09c32e1cdecb5fdb1f7cc;ubuntu.pdf
8d4cd57f498c17a91e93fd3d9c39192b;ubunturef.pdf

databasefile:
Code:

dd7bec79dc95fe49af149a82c9ce092e;GettingStartedWithUbuntu.pdf
13dbbcb2fd46be8d5858a3f69c5d55ab;serverguide.pdf

Using this command, I can search for duplicate files, and remove them
Code:

awk 'BEGIN {FS=OFS=";"} FNR==NR{arr[$1];next} $1 in arr {print}' newfile databasefile | awk -F ";" '{print $2}' > temp

num=$(wc -w temp | awk '{print $1}')

if [ "$num" != 0 ]; then
    xargs rm < temp
fi

rm temp

The question now, how can I find for the non-duplicate lines between newfile and databasefile?

I already try this command but with no luck:

Code:

awk 'BEGIN {FS=OFS=";"} FNR==NR{arr[$1];next} !($1 in arr) {print}' newfile databasefile
Thank you...

theNbomr 12-17-2011 12:45 PM

Read all lines from all files, accumulating counts of lines found in an associative array. When all the files have been read, the unique lines will have a count of '1'. You must use this test before printing your result:
Code:

awk '{lineCounts[$0]++} END{ for( line in lineCounts ){ if( lineCounts[line] == 1 ){ print line;} } }' newfile databasefile
--- rod.

sopier 12-17-2011 06:02 PM

Hi, theNbomr command is work for this situation, because each line is unique, in the real case, sometimes we find file with different filename, but have similar md5sum, and theNbomr will not work...

I want to looking for unique line based on field $1, not the $0

Any other suggestion? Thank you...

Reuti 12-18-2011 04:06 PM

The join command has an option to print only unmatched lines:
Code:

$ join -v 1 -t ';' -o 1.2 <(sort newfile) <(sort databasefile)
ubuntu.pdf
ubunturef.pdf

If you need the complete line just leave out the -o option.

sopier 12-18-2011 05:24 PM

Thank you to Reuti and theNbomr...

Both of your solution is work... I have two steps now in my code, the first is to make sure "newfile" contained unique md5sum file only with:

Code:

awk -F ";" 'x !~ $1; {x=$1}' newfile
the next step is compare them with databasefile, either using theNbomr advice:
Code:

awk '{lineCounts[$0]++} END{ for( line in lineCounts ){ if( lineCounts[line] == 1 ){ print line;} } }' newfile databasefile
or, using Reuti suggestion:
Code:

join -v 1 -t ';' -o 1.2 <(sort newfile) <(sort databasefile)
Finally... thank you all :)

jschiwal 12-18-2011 05:35 PM

Also look at the comm command.
comm -12 <(sort file1) <(sort file2)
will give you a list of common lines.

---------- Post added 12-18-11 at 05:36 PM ----------

Also look at the comm command.
comm -12 <(sort file1) <(sort file2)
will give you a list of common lines.

sopier 12-19-2011 02:38 AM

Thank you jschiwal... :)

Reuti 12-19-2011 03:52 AM

Quote:

Originally Posted by jschiwal (Post 4553244)
Also look at the comm command.
comm -12 <(sort file1) <(sort file2)
will give you a list of common lines.

I must admit that I left out the field on which the match should be made in the join command, as itís the first by default. AFAICS comm will match complete lines, which might not work for all of the records in the context of the OP.

jschiwal 12-26-2011 02:53 PM

The input files look like lists of hash;filename, probably created by script or command. I don't think the input will be vary between identical inputs.

I will create lists of the md5 output to locate duplicates based on the md5sum column.
sort | uniq -w32 -D


All times are GMT -5. The time now is 01:13 AM.