ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I hope someone can help me with this... I have two large ldif files (more then 10GB) that contains users information. I would like to match these two large files and print out duplicated uid's matching in them.
I ran the script as it once...It find uid's. But later on when I do ldapsearch for those id's it exist in one db but not in the other? Is ithis finding the duplicates really? Not sure if it is working at the momnet...Could you please help?
Thanks once again
Quote:
Originally Posted by cmontr
I appreciate for the assistance, I will run and let you know later today.
If one of these LDIF files is for a live ldap server, you could try using ldapsearch to extract the "uid=*" values instead of using grep.
If you want do do a lot of searches you might consider migrating the ldif files into a sleepycat, mysql or sqlite database and using sql commands to report the info you want.
/^uid/ { uids[$2]++
}
END { for ( i in uids )
if ( uids[i] > 1 )
print i
}
This assumes that there are no duplicate uids in the same file. Otherwise you can try this
Code:
/^uid/ { if ( ARGIND == 2 )
uids[$2]++
else
uids[$2] = 1
}
END { for ( i in uids )
if ( uids[i] > 1 )
print i
}
where the array element is incremented only while reading the second file. The ARGIND built-in variable is a GNU awk extension. You can run this program using the -f option to gawk and passing the two files as arguments. While reading the first file, ARGIND will be set to 1, while processing the second file ARGIND will be set to 2. For compatibility issues, use the FILENAME built-in variable.
But later on when I do ldapsearch for those id's it exist in one db but not in the other? Is ithis finding the duplicates really? Not sure if it is working at the momnet...Could you please help?
It *should* work ... of course, if the uid's are like
User123 and USER123 it won't pick them up as dupes ...
If one of these LDIF files is for a live ldap server, you could try using ldapsearch to extract the "uid=*" values instead of using grep.
If you want do do a lot of searches you might consider migrating the ldif files into a sleepycat, mysql or sqlite database and using sql commands to report the info you want.
Could you give me an example running awk line against those 2 files please?
Thanks again
Quote:
Originally Posted by colucix
Using gawk you can do something like
Code:
/^uid/ { uids[$2]++
}
END { for ( i in uids )
if ( uids[i] > 1 )
print i
}
This assumes that there are no duplicate uids in the same file. Otherwise you can try this
Code:
/^uid/ { if ( ARGIND == 2 )
uids[$2]++
else
uids[$2] = 1
}
END { for ( i in uids )
if ( uids[i] > 1 )
print i
}
where the array element is incremented only while reading the second file. The ARGIND built-in variable is a GNU awk extension. You can run this program using the -f option to gawk and passing the two files as arguments. While reading the first file, ARGIND will be set to 1, while processing the second file ARGIND will be set to 2. For compatibility issues, use the FILENAME built-in variable.
Mmm I am not sure what you meant really, I 'd appreciate if you explain. Purpose is to print duplicates in those 2 large files. Then I should be able to ldapsearch of those uids and see if they exist in both database. Please let me know.
Quote:
Originally Posted by Tinkster
It *should* work ... of course, if the uid's are like
User123 and USER123 it won't pick them up as dupes ...
Could you give me an example running awk line against those 2 files please?
Thanks again
You asked for awk, I thought you knew how to run it. Anyway, either store the script in a file, e.g. dupes.awk, then launch this script giving the two files as arguments
Code:
gawk -f dupes.awk file1 file2
or directly from the command line digit
Code:
gawk '
/^uid/ { uids[$2]++
}
END { for ( i in uids )
if ( uids[i] > 1 )
print i
}' file1 file2
Better the first method, since you probably need to do some modifications to the awk script. Mine is only an example on how gawk can be used to solve this problem, given the conditions I explained in my previous post. Cheers!
Thanks for the help but I am still having issue with this:
When I used "grep '^uid: ' FILE1 | uniq >UIDS1"
Following error was received.
grep: illegal option -- f
Usage: grep -hblcnsviw pattern file
and when I used this "comm -12 UIDS1 UIDS2 >DuplicateUids"
I got ids printed from UIDS1 but when I did ldapsearch for random of those were not duplicated ones. It didnt exist in the other DB.
I am kind of stuck at the moment. If any idea would be very much appreciated.
It looks like your grep grep doesn't support the "-f" option.
Before you used the "comm" program, did you use the grep lines that piped the output through sort? The two files need to be sorted. You could do something like
comm -12 <(grep '^uid:' file1 | sort) <(grep '^uid:' file2 | sort)
The comm command prints out 3 columns. 1st: unique to file 1; 2nd: unique to file2; 3rd: common to both.
I don't use ldap, so I 'm at a bit of a disadvantage. Are the ldif files something that are imported and incorporated in whatever backend the ldap server uses. If so, are you sure that they are current.
Are you comparing the UIDs of users on two different ldap servers. If so, why not run ldapsearch on both to produce the two lists. That way you will start with smaller files to work with that you know are current.
Quote:
I got ids printed from UIDS1 but when I did ldapsearch for random of those were not duplicated ones. It didnt exist in the other DB.
Take a couple of these examples and search on the corresponding ldif file that you started with.
This is something that should be verified.
Last edited by jschiwal; 05-24-2008 at 05:07 PM.
Reason: fixed another typo
I ran the script as it once...It find uid's. But later on when I do ldapsearch for those id's it exist in one db but not in the other? Is ithis finding the duplicates really? Not sure if it is working at the momnet...Could you please help?
Thanks once again
hmmm... it should have worked i guess unless sort and/or uniq has it limits.. or perhaps you forgot to add the option '-d' to uniq..?
I appreciate for all the assistance but it loks like server doesnt have gawk. I tried with nawk but is is bailing out as well.
nawk >>> ' <<<
nawk: bailing out at source line 6
usacpicnto01: oracle(vols1) /bu1/ora
Could you please show me where I am making a mistake? All I did was replaced the FILE1 and FILE2 with actual file names.
This is a Solaris server sparc SUNW,Sun-Fire-480R
Thanks millions
Quote:
Originally Posted by colucix
You asked for awk, I thought you knew how to run it. Anyway, either store the script in a file, e.g. dupes.awk, then launch this script giving the two files as arguments
Code:
gawk -f dupes.awk file1 file2
or directly from the command line digit
Code:
gawk '
/^uid/ { uids[$2]++
}
END { for ( i in uids )
if ( uids[i] > 1 )
print i
}' file1 file2
Better the first method, since you probably need to do some modifications to the awk script. Mine is only an example on how gawk can be used to solve this problem, given the conditions I explained in my previous post. Cheers!
What have you tried exactly? I tested the following
Code:
/^uid/ { uids[$2]++
}
END { for ( i in uids )
if ( uids[i] > 1 )
print i
}
and it works either with awk or nawk on a Solaris SPARC 5.8. Please, update your profile with your distribution or tell about what system the issue applies to, especially for non Linux OS. This will bring to more pertinent answers.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.