Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
12-17-2011, 06:21 AM
|
#1
|
Member
Registered: Dec 2011
Location: Jogja, Indonesia
Distribution: Ubuntu
Posts: 33
Rep: 
|
Comparing two files and looking for the same line based on field - awk
I have two files:
newfile:
Code:
dd7bec79dc95fe49af149a82c9ce092e;GettingStartedWithUbuntu.pdf
13dbbcb2fd46be8d5858a3f69c5d55ab;serverguide.pdf
5a160b28e7b09c32e1cdecb5fdb1f7cc;ubuntu.pdf
8d4cd57f498c17a91e93fd3d9c39192b;ubunturef.pdf
databasefile:
Code:
dd7bec79dc95fe49af149a82c9ce092e;GettingStartedWithUbuntu.pdf
13dbbcb2fd46be8d5858a3f69c5d55ab;serverguide.pdf
Using this command, I can search for duplicate files, and remove them
Code:
awk 'BEGIN {FS=OFS=";"} FNR==NR{arr[$1];next} $1 in arr {print}' newfile databasefile | awk -F ";" '{print $2}' > temp
num=$(wc -w temp | awk '{print $1}')
if [ "$num" != 0 ]; then
xargs rm < temp
fi
rm temp
The question now, how can I find for the non-duplicate lines between newfile and databasefile?
I already try this command but with no luck:
Code:
awk 'BEGIN {FS=OFS=";"} FNR==NR{arr[$1];next} !($1 in arr) {print}' newfile databasefile
Thank you...
|
|
|
12-17-2011, 12:45 PM
|
#2
|
LQ 5k Club
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
|
Read all lines from all files, accumulating counts of lines found in an associative array. When all the files have been read, the unique lines will have a count of '1'. You must use this test before printing your result:
Code:
awk '{lineCounts[$0]++} END{ for( line in lineCounts ){ if( lineCounts[line] == 1 ){ print line;} } }' newfile databasefile
--- rod.
|
|
|
12-17-2011, 06:02 PM
|
#3
|
Member
Registered: Dec 2011
Location: Jogja, Indonesia
Distribution: Ubuntu
Posts: 33
Original Poster
Rep: 
|
Hi, theNbomr command is work for this situation, because each line is unique, in the real case, sometimes we find file with different filename, but have similar md5sum, and theNbomr will not work...
I want to looking for unique line based on field $1, not the $0
Any other suggestion? Thank you...
Last edited by sopier; 12-18-2011 at 04:53 AM.
Reason: code need improvement
|
|
|
12-18-2011, 04:06 PM
|
#4
|
Senior Member
Registered: Dec 2004
Location: Marburg, Germany
Distribution: openSUSE 15.2
Posts: 1,339
|
The join command has an option to print only unmatched lines:
Code:
$ join -v 1 -t ';' -o 1.2 <(sort newfile) <(sort databasefile)
ubuntu.pdf
ubunturef.pdf
If you need the complete line just leave out the -o option.
|
|
|
12-18-2011, 05:24 PM
|
#5
|
Member
Registered: Dec 2011
Location: Jogja, Indonesia
Distribution: Ubuntu
Posts: 33
Original Poster
Rep: 
|
Thank you to Reuti and theNbomr...
Both of your solution is work... I have two steps now in my code, the first is to make sure "newfile" contained unique md5sum file only with:
Code:
awk -F ";" 'x !~ $1; {x=$1}' newfile
the next step is compare them with databasefile, either using theNbomr advice:
Code:
awk '{lineCounts[$0]++} END{ for( line in lineCounts ){ if( lineCounts[line] == 1 ){ print line;} } }' newfile databasefile
or, using Reuti suggestion:
Code:
join -v 1 -t ';' -o 1.2 <(sort newfile) <(sort databasefile)
Finally... thank you all 
|
|
|
12-18-2011, 05:35 PM
|
#6
|
LQ Guru
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733
|
Also look at the comm command.
comm -12 <(sort file1) <(sort file2)
will give you a list of common lines.
---------- Post added 12-18-11 at 05:36 PM ----------
Also look at the comm command.
comm -12 <(sort file1) <(sort file2)
will give you a list of common lines.
|
|
|
12-19-2011, 02:38 AM
|
#7
|
Member
Registered: Dec 2011
Location: Jogja, Indonesia
Distribution: Ubuntu
Posts: 33
Original Poster
Rep: 
|
Thank you jschiwal... 
|
|
|
12-19-2011, 03:52 AM
|
#8
|
Senior Member
Registered: Dec 2004
Location: Marburg, Germany
Distribution: openSUSE 15.2
Posts: 1,339
|
Quote:
Originally Posted by jschiwal
Also look at the comm command.
comm -12 <(sort file1) <(sort file2)
will give you a list of common lines.
|
I must admit that I left out the field on which the match should be made in the join command, as it’s the first by default. AFAICS comm will match complete lines, which might not work for all of the records in the context of the OP.
|
|
|
12-26-2011, 02:53 PM
|
#9
|
LQ Guru
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733
|
The input files look like lists of hash;filename, probably created by script or command. I don't think the input will be vary between identical inputs.
I will create lists of the md5 output to locate duplicates based on the md5sum column.
sort | uniq -w32 -D
|
|
|
All times are GMT -5. The time now is 05:03 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|