Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
|
05-22-2008, 07:46 PM
|
#1
|
Member
Registered: Sep 2007
Posts: 175
Rep:
|
Shell Script / Awk help for a challenge
I hope someone can help me with this... I have two large ldif files (more then 10GB) that contains users information. I would like to match these two large files and print out duplicated uid's matching in them.
in the ldif files user information looks like:
dn: uid=USER123,ou=cnet,o=cbc.com
uid: USER123
cbcdomain: cbc.net
I appreciate for the help
|
|
|
05-22-2008, 10:49 PM
|
#2
|
Senior Member
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
|
my solution:
(a) extract the ids from the file
Code:
grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' > userids
(b) sort the file (may really require great amount of time and memory)
Code:
sort userids > userids.sorted
(c) extract the duplicate entries:
Code:
uniq -d userids.sorted > userids.dups
you can do everything in one shot but it might require great amount of memory, time and cpu usage (and probably also hang your pc):
Code:
grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' | sort | uniq -d > userids.dups
edit: btw for the step-by-step process, you can also compress the files to save hd space:
Code:
grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' | gzip -c -9 > userids.gz
zcat userids.gz | sort | gzip -c -9 > userids.sorted.gz
zcat userids.sorted.gz | uniq -d | gzip -c -9 > userids.dups.gz
Last edited by konsolebox; 05-22-2008 at 10:54 PM.
|
|
|
05-23-2008, 10:48 AM
|
#3
|
Member
Registered: Sep 2007
Posts: 175
Original Poster
Rep:
|
I appreciate for the assistance, I will run and let you know later today.
Quote:
Originally Posted by konsolebox
my solution:
(a) extract the ids from the file
Code:
grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' > userids
(b) sort the file (may really require great amount of time and memory)
Code:
sort userids > userids.sorted
(c) extract the duplicate entries:
Code:
uniq -d userids.sorted > userids.dups
you can do everything in one shot but it might require great amount of memory, time and cpu usage (and probably also hang your pc):
Code:
grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' | sort | uniq -d > userids.dups
edit: btw for the step-by-step process, you can also compress the files to save hd space:
Code:
grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' | gzip -c -9 > userids.gz
zcat userids.gz | sort | gzip -c -9 > userids.sorted.gz
zcat userids.sorted.gz | uniq -d | gzip -c -9 > userids.dups.gz
|
|
|
|
05-23-2008, 03:31 PM
|
#4
|
Member
Registered: Sep 2007
Posts: 175
Original Poster
Rep:
|
Hi again,
I ran the script as it once...It find uid's. But later on when I do ldapsearch for those id's it exist in one db but not in the other? Is ithis finding the duplicates really? Not sure if it is working at the momnet...Could you please help?
Thanks once again
Quote:
Originally Posted by cmontr
I appreciate for the assistance, I will run and let you know later today.
|
|
|
|
05-23-2008, 04:55 PM
|
#5
|
LQ Guru
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733
|
I don't know if you will have memory issues. You could try
grep '^uid: ' FILE1 | uniq >UIDS1
grep '^uid: ' FILE2 | uniq >UIDS2
grep -f UIDS1 FILE2
or
grep '^uid: ' FILE1 | sort | uniq >UIDS1
grep '^uid: ' FILE2 | sort | uniq >UIDS2
comm -12 UIDS1 UIDS2 >DuplicateUids
If one of these LDIF files is for a live ldap server, you could try using ldapsearch to extract the "uid=*" values instead of using grep.
If you want do do a lot of searches you might consider migrating the ldif files into a sleepycat, mysql or sqlite database and using sql commands to report the info you want.
Last edited by jschiwal; 05-23-2008 at 04:56 PM.
|
|
|
05-23-2008, 05:32 PM
|
#6
|
LQ Guru
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509
|
Using gawk you can do something like
Code:
/^uid/ { uids[$2]++
}
END { for ( i in uids )
if ( uids[i] > 1 )
print i
}
This assumes that there are no duplicate uids in the same file. Otherwise you can try this
Code:
/^uid/ { if ( ARGIND == 2 )
uids[$2]++
else
uids[$2] = 1
}
END { for ( i in uids )
if ( uids[i] > 1 )
print i
}
where the array element is incremented only while reading the second file. The ARGIND built-in variable is a GNU awk extension. You can run this program using the -f option to gawk and passing the two files as arguments. While reading the first file, ARGIND will be set to 1, while processing the second file ARGIND will be set to 2. For compatibility issues, use the FILENAME built-in variable.
|
|
|
05-23-2008, 08:31 PM
|
#7
|
Moderator
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
|
Quote:
But later on when I do ldapsearch for those id's it exist in one db but not in the other? Is ithis finding the duplicates really? Not sure if it is working at the momnet...Could you please help?
|
It *should* work ... of course, if the uid's are like
User123 and USER123 it won't pick them up as dupes ...
Last edited by Tinkster; 05-23-2008 at 08:32 PM.
|
|
|
05-24-2008, 01:44 AM
|
#8
|
Member
Registered: Sep 2007
Posts: 175
Original Poster
Rep:
|
Thanks for the help but I am still having issue with this:
When I used "grep '^uid: ' FILE1 | uniq >UIDS1"
Following error was received.
grep: illegal option -- f
Usage: grep -hblcnsviw pattern file
and when I used this "comm -12 UIDS1 UIDS2 >DuplicateUids"
I got ids printed from UIDS1 but when I did ldapsearch for random of those were not duplicated ones. It didnt exist in the other DB.
I am kind of stuck at the moment. If any idea would be very much appreciated.
Quote:
Originally Posted by jschiwal
I don't know if you will have memory issues. You could try
grep '^uid: ' FILE1 | uniq >UIDS1
grep '^uid: ' FILE2 | uniq >UIDS2
grep -f UIDS1 FILE2
or
grep '^uid: ' FILE1 | sort | uniq >UIDS1
grep '^uid: ' FILE2 | sort | uniq >UIDS2
comm -12 UIDS1 UIDS2 >DuplicateUids
If one of these LDIF files is for a live ldap server, you could try using ldapsearch to extract the "uid=*" values instead of using grep.
If you want do do a lot of searches you might consider migrating the ldif files into a sleepycat, mysql or sqlite database and using sql commands to report the info you want.
|
|
|
|
05-24-2008, 01:46 AM
|
#9
|
Member
Registered: Sep 2007
Posts: 175
Original Poster
Rep:
|
Could you give me an example running awk line against those 2 files please?
Thanks again
Quote:
Originally Posted by colucix
Using gawk you can do something like
Code:
/^uid/ { uids[$2]++
}
END { for ( i in uids )
if ( uids[i] > 1 )
print i
}
This assumes that there are no duplicate uids in the same file. Otherwise you can try this
Code:
/^uid/ { if ( ARGIND == 2 )
uids[$2]++
else
uids[$2] = 1
}
END { for ( i in uids )
if ( uids[i] > 1 )
print i
}
where the array element is incremented only while reading the second file. The ARGIND built-in variable is a GNU awk extension. You can run this program using the -f option to gawk and passing the two files as arguments. While reading the first file, ARGIND will be set to 1, while processing the second file ARGIND will be set to 2. For compatibility issues, use the FILENAME built-in variable.
|
|
|
|
05-24-2008, 01:48 AM
|
#10
|
Member
Registered: Sep 2007
Posts: 175
Original Poster
Rep:
|
Mmm I am not sure what you meant really, I 'd appreciate if you explain. Purpose is to print duplicates in those 2 large files. Then I should be able to ldapsearch of those uids and see if they exist in both database. Please let me know.
Quote:
Originally Posted by Tinkster
It *should* work ... of course, if the uid's are like
User123 and USER123 it won't pick them up as dupes ...
|
|
|
|
05-24-2008, 03:04 AM
|
#11
|
LQ Guru
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509
|
Quote:
Originally Posted by cmontr
Could you give me an example running awk line against those 2 files please?
Thanks again
|
You asked for awk, I thought you knew how to run it. Anyway, either store the script in a file, e.g. dupes.awk, then launch this script giving the two files as arguments
Code:
gawk -f dupes.awk file1 file2
or directly from the command line digit
Code:
gawk '
/^uid/ { uids[$2]++
}
END { for ( i in uids )
if ( uids[i] > 1 )
print i
}' file1 file2
Better the first method, since you probably need to do some modifications to the awk script. Mine is only an example on how gawk can be used to solve this problem, given the conditions I explained in my previous post. Cheers!
|
|
|
05-24-2008, 03:42 AM
|
#12
|
LQ Guru
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733
|
Quote:
Originally Posted by cmontr
Thanks for the help but I am still having issue with this:
When I used "grep '^uid: ' FILE1 | uniq >UIDS1"
Following error was received.
grep: illegal option -- f
Usage: grep -hblcnsviw pattern file
and when I used this "comm -12 UIDS1 UIDS2 >DuplicateUids"
I got ids printed from UIDS1 but when I did ldapsearch for random of those were not duplicated ones. It didnt exist in the other DB.
I am kind of stuck at the moment. If any idea would be very much appreciated.
|
It looks like your grep grep doesn't support the "-f" option.
Before you used the "comm" program, did you use the grep lines that piped the output through sort? The two files need to be sorted. You could do something like
comm -12 <(grep '^uid:' file1 | sort) <(grep '^uid:' file2 | sort)
The comm command prints out 3 columns. 1st: unique to file 1; 2nd: unique to file2; 3rd: common to both.
I don't use ldap, so I 'm at a bit of a disadvantage. Are the ldif files something that are imported and incorporated in whatever backend the ldap server uses. If so, are you sure that they are current.
Are you comparing the UIDs of users on two different ldap servers. If so, why not run ldapsearch on both to produce the two lists. That way you will start with smaller files to work with that you know are current.
Quote:
I got ids printed from UIDS1 but when I did ldapsearch for random of those were not duplicated ones. It didnt exist in the other DB.
|
Take a couple of these examples and search on the corresponding ldif file that you started with.
This is something that should be verified.
Last edited by jschiwal; 05-24-2008 at 05:07 PM.
Reason: fixed another typo
|
|
|
05-24-2008, 06:54 AM
|
#13
|
Senior Member
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
|
Quote:
Originally Posted by cmontr
Hi again,
I ran the script as it once...It find uid's. But later on when I do ldapsearch for those id's it exist in one db but not in the other? Is ithis finding the duplicates really? Not sure if it is working at the momnet...Could you please help?
Thanks once again
|
hmmm... it should have worked i guess unless sort and/or uniq has it limits.. or perhaps you forgot to add the option '-d' to uniq..?
|
|
|
05-24-2008, 12:00 PM
|
#14
|
Member
Registered: Sep 2007
Posts: 175
Original Poster
Rep:
|
I appreciate for all the assistance but it loks like server doesnt have gawk. I tried with nawk but is is bailing out as well.
nawk >>> ' <<<
nawk: bailing out at source line 6
usacpicnto01: oracle(vols1) /bu1/ora
Could you please show me where I am making a mistake? All I did was replaced the FILE1 and FILE2 with actual file names.
This is a Solaris server sparc SUNW,Sun-Fire-480R
Thanks millions
Quote:
Originally Posted by colucix
You asked for awk, I thought you knew how to run it. Anyway, either store the script in a file, e.g. dupes.awk, then launch this script giving the two files as arguments
Code:
gawk -f dupes.awk file1 file2
or directly from the command line digit
Code:
gawk '
/^uid/ { uids[$2]++
}
END { for ( i in uids )
if ( uids[i] > 1 )
print i
}' file1 file2
Better the first method, since you probably need to do some modifications to the awk script. Mine is only an example on how gawk can be used to solve this problem, given the conditions I explained in my previous post. Cheers!
|
|
|
|
05-24-2008, 12:36 PM
|
#15
|
LQ Guru
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509
|
What have you tried exactly? I tested the following
Code:
/^uid/ { uids[$2]++
}
END { for ( i in uids )
if ( uids[i] > 1 )
print i
}
and it works either with awk or nawk on a Solaris SPARC 5.8. Please, update your profile with your distribution or tell about what system the issue applies to, especially for non Linux OS. This will bring to more pertinent answers.
|
|
|
All times are GMT -5. The time now is 03:30 PM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|