Comparing two files and looking for the same line based on field - awk
I have two files:
newfile: Code:
dd7bec79dc95fe49af149a82c9ce092e;GettingStartedWithUbuntu.pdf Code:
dd7bec79dc95fe49af149a82c9ce092e;GettingStartedWithUbuntu.pdf Code:
awk 'BEGIN {FS=OFS=";"} FNR==NR{arr[$1];next} $1 in arr {print}' newfile databasefile | awk -F ";" '{print $2}' > temp I already try this command but with no luck: Code:
awk 'BEGIN {FS=OFS=";"} FNR==NR{arr[$1];next} !($1 in arr) {print}' newfile databasefile |
Read all lines from all files, accumulating counts of lines found in an associative array. When all the files have been read, the unique lines will have a count of '1'. You must use this test before printing your result:
Code:
awk '{lineCounts[$0]++} END{ for( line in lineCounts ){ if( lineCounts[line] == 1 ){ print line;} } }' newfile databasefile |
Hi, theNbomr command is work for this situation, because each line is unique, in the real case, sometimes we find file with different filename, but have similar md5sum, and theNbomr will not work...
I want to looking for unique line based on field $1, not the $0 Any other suggestion? Thank you... |
The join command has an option to print only unmatched lines:
Code:
$ join -v 1 -t ';' -o 1.2 <(sort newfile) <(sort databasefile) |
Thank you to Reuti and theNbomr...
Both of your solution is work... I have two steps now in my code, the first is to make sure "newfile" contained unique md5sum file only with: Code:
awk -F ";" 'x !~ $1; {x=$1}' newfile Code:
awk '{lineCounts[$0]++} END{ for( line in lineCounts ){ if( lineCounts[line] == 1 ){ print line;} } }' newfile databasefile Code:
join -v 1 -t ';' -o 1.2 <(sort newfile) <(sort databasefile) |
Also look at the comm command.
comm -12 <(sort file1) <(sort file2) will give you a list of common lines. ---------- Post added 12-18-11 at 05:36 PM ---------- Also look at the comm command. comm -12 <(sort file1) <(sort file2) will give you a list of common lines. |
Thank you jschiwal... :)
|
Quote:
|
The input files look like lists of hash;filename, probably created by script or command. I don't think the input will be vary between identical inputs.
I will create lists of the md5 output to locate duplicates based on the md5sum column. sort | uniq -w32 -D |
All times are GMT -5. The time now is 02:49 AM. |