LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 12-17-2011, 06:21 AM   #1
sopier
Member
 
Registered: Dec 2011
Location: Jogja, Indonesia
Distribution: Ubuntu
Posts: 33

Rep: Reputation: Disabled
Comparing two files and looking for the same line based on field - awk


I have two files:

newfile:
Code:
dd7bec79dc95fe49af149a82c9ce092e;GettingStartedWithUbuntu.pdf
13dbbcb2fd46be8d5858a3f69c5d55ab;serverguide.pdf
5a160b28e7b09c32e1cdecb5fdb1f7cc;ubuntu.pdf
8d4cd57f498c17a91e93fd3d9c39192b;ubunturef.pdf
databasefile:
Code:
dd7bec79dc95fe49af149a82c9ce092e;GettingStartedWithUbuntu.pdf
13dbbcb2fd46be8d5858a3f69c5d55ab;serverguide.pdf
Using this command, I can search for duplicate files, and remove them
Code:
awk 'BEGIN {FS=OFS=";"} FNR==NR{arr[$1];next} $1 in arr {print}' newfile databasefile | awk -F ";" '{print $2}' > temp

num=$(wc -w temp | awk '{print $1}')

if [ "$num" != 0 ]; then
    xargs rm < temp
fi

rm temp
The question now, how can I find for the non-duplicate lines between newfile and databasefile?

I already try this command but with no luck:

Code:
awk 'BEGIN {FS=OFS=";"} FNR==NR{arr[$1];next} !($1 in arr) {print}' newfile databasefile
Thank you...
 
Old 12-17-2011, 12:45 PM   #2
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
Read all lines from all files, accumulating counts of lines found in an associative array. When all the files have been read, the unique lines will have a count of '1'. You must use this test before printing your result:
Code:
awk '{lineCounts[$0]++} END{ for( line in lineCounts ){ if( lineCounts[line] == 1 ){ print line;} } }' newfile databasefile
--- rod.
 
Old 12-17-2011, 06:02 PM   #3
sopier
Member
 
Registered: Dec 2011
Location: Jogja, Indonesia
Distribution: Ubuntu
Posts: 33

Original Poster
Rep: Reputation: Disabled
Hi, theNbomr command is work for this situation, because each line is unique, in the real case, sometimes we find file with different filename, but have similar md5sum, and theNbomr will not work...

I want to looking for unique line based on field $1, not the $0

Any other suggestion? Thank you...

Last edited by sopier; 12-18-2011 at 04:53 AM. Reason: code need improvement
 
Old 12-18-2011, 04:06 PM   #4
Reuti
Senior Member
 
Registered: Dec 2004
Location: Marburg, Germany
Distribution: openSUSE 15.2
Posts: 1,339

Rep: Reputation: 260Reputation: 260Reputation: 260
The join command has an option to print only unmatched lines:
Code:
$ join -v 1 -t ';' -o 1.2 <(sort newfile) <(sort databasefile)
ubuntu.pdf
ubunturef.pdf
If you need the complete line just leave out the -o option.
 
Old 12-18-2011, 05:24 PM   #5
sopier
Member
 
Registered: Dec 2011
Location: Jogja, Indonesia
Distribution: Ubuntu
Posts: 33

Original Poster
Rep: Reputation: Disabled
Thumbs up

Thank you to Reuti and theNbomr...

Both of your solution is work... I have two steps now in my code, the first is to make sure "newfile" contained unique md5sum file only with:

Code:
awk -F ";" 'x !~ $1; {x=$1}' newfile
the next step is compare them with databasefile, either using theNbomr advice:
Code:
awk '{lineCounts[$0]++} END{ for( line in lineCounts ){ if( lineCounts[line] == 1 ){ print line;} } }' newfile databasefile
or, using Reuti suggestion:
Code:
join -v 1 -t ';' -o 1.2 <(sort newfile) <(sort databasefile)
Finally... thank you all
 
Old 12-18-2011, 05:35 PM   #6
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
Also look at the comm command.
comm -12 <(sort file1) <(sort file2)
will give you a list of common lines.

---------- Post added 12-18-11 at 05:36 PM ----------

Also look at the comm command.
comm -12 <(sort file1) <(sort file2)
will give you a list of common lines.
 
Old 12-19-2011, 02:38 AM   #7
sopier
Member
 
Registered: Dec 2011
Location: Jogja, Indonesia
Distribution: Ubuntu
Posts: 33

Original Poster
Rep: Reputation: Disabled
Thank you jschiwal...
 
Old 12-19-2011, 03:52 AM   #8
Reuti
Senior Member
 
Registered: Dec 2004
Location: Marburg, Germany
Distribution: openSUSE 15.2
Posts: 1,339

Rep: Reputation: 260Reputation: 260Reputation: 260
Quote:
Originally Posted by jschiwal View Post
Also look at the comm command.
comm -12 <(sort file1) <(sort file2)
will give you a list of common lines.
I must admit that I left out the field on which the match should be made in the join command, as it’s the first by default. AFAICS comm will match complete lines, which might not work for all of the records in the context of the OP.
 
Old 12-26-2011, 02:53 PM   #9
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
The input files look like lists of hash;filename, probably created by script or command. I don't think the input will be vary between identical inputs.

I will create lists of the md5 output to locate duplicates based on the md5sum column.
sort | uniq -w32 -D
 
  


Reply

Tags
awk, join


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
AWK Comparing two files elonden Programming 2 12-09-2011 09:23 AM
problem while comparing awk field variable with input variable entered using keyboard vinay007 Programming 12 08-23-2011 12:44 AM
[SOLVED] Comparing two fields in two files using Awk. Tauro Linux - Newbie 16 07-21-2011 12:47 AM
awk command line: blank line record sep, new line field sep robertmarkbram Programming 4 02-21-2010 05:25 AM
Replace a field for a whole line in the same file, with awk. amwink Programming 12 11-13-2009 06:51 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 11:32 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration