LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-22-2008, 07:46 PM   #1
cmontr
Member
 
Registered: Sep 2007
Posts: 175

Rep: Reputation: 15
Shell Script / Awk help for a challenge


I hope someone can help me with this... I have two large ldif files (more then 10GB) that contains users information. I would like to match these two large files and print out duplicated uid's matching in them.

in the ldif files user information looks like:

dn: uid=USER123,ou=cnet,o=cbc.com
uid: USER123
cbcdomain: cbc.net


I appreciate for the help
 
Old 05-22-2008, 10:49 PM   #2
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235
my solution:
(a) extract the ids from the file
Code:
grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' > userids
(b) sort the file (may really require great amount of time and memory)
Code:
sort userids > userids.sorted
(c) extract the duplicate entries:
Code:
uniq -d userids.sorted > userids.dups
you can do everything in one shot but it might require great amount of memory, time and cpu usage (and probably also hang your pc):
Code:
grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' | sort | uniq -d > userids.dups
edit: btw for the step-by-step process, you can also compress the files to save hd space:
Code:
grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' | gzip -c -9 > userids.gz
zcat userids.gz | sort | gzip -c -9 > userids.sorted.gz
zcat userids.sorted.gz | uniq -d | gzip -c -9 > userids.dups.gz

Last edited by konsolebox; 05-22-2008 at 10:54 PM.
 
Old 05-23-2008, 10:48 AM   #3
cmontr
Member
 
Registered: Sep 2007
Posts: 175

Original Poster
Rep: Reputation: 15
I appreciate for the assistance, I will run and let you know later today.

Quote:
Originally Posted by konsolebox View Post
my solution:
(a) extract the ids from the file
Code:
grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' > userids
(b) sort the file (may really require great amount of time and memory)
Code:
sort userids > userids.sorted
(c) extract the duplicate entries:
Code:
uniq -d userids.sorted > userids.dups
you can do everything in one shot but it might require great amount of memory, time and cpu usage (and probably also hang your pc):
Code:
grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' | sort | uniq -d > userids.dups
edit: btw for the step-by-step process, you can also compress the files to save hd space:
Code:
grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' | gzip -c -9 > userids.gz
zcat userids.gz | sort | gzip -c -9 > userids.sorted.gz
zcat userids.sorted.gz | uniq -d | gzip -c -9 > userids.dups.gz
 
Old 05-23-2008, 03:31 PM   #4
cmontr
Member
 
Registered: Sep 2007
Posts: 175

Original Poster
Rep: Reputation: 15
Hi again,

I ran the script as it once...It find uid's. But later on when I do ldapsearch for those id's it exist in one db but not in the other? Is ithis finding the duplicates really? Not sure if it is working at the momnet...Could you please help?

Thanks once again

Quote:
Originally Posted by cmontr View Post
I appreciate for the assistance, I will run and let you know later today.
 
Old 05-23-2008, 04:55 PM   #5
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
I don't know if you will have memory issues. You could try
grep '^uid: ' FILE1 | uniq >UIDS1
grep '^uid: ' FILE2 | uniq >UIDS2
grep -f UIDS1 FILE2

or

grep '^uid: ' FILE1 | sort | uniq >UIDS1
grep '^uid: ' FILE2 | sort | uniq >UIDS2
comm -12 UIDS1 UIDS2 >DuplicateUids

If one of these LDIF files is for a live ldap server, you could try using ldapsearch to extract the "uid=*" values instead of using grep.

If you want do do a lot of searches you might consider migrating the ldif files into a sleepycat, mysql or sqlite database and using sql commands to report the info you want.

Last edited by jschiwal; 05-23-2008 at 04:56 PM.
 
Old 05-23-2008, 05:32 PM   #6
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Using gawk you can do something like
Code:
/^uid/ { uids[$2]++
}

END { for ( i in uids )
         if ( uids[i] > 1 )
              print i
}
This assumes that there are no duplicate uids in the same file. Otherwise you can try this
Code:
/^uid/ { if ( ARGIND == 2 )
              uids[$2]++ 
         else
              uids[$2] = 1
}

END { for ( i in uids )
         if ( uids[i] > 1 )
              print i
}
where the array element is incremented only while reading the second file. The ARGIND built-in variable is a GNU awk extension. You can run this program using the -f option to gawk and passing the two files as arguments. While reading the first file, ARGIND will be set to 1, while processing the second file ARGIND will be set to 2. For compatibility issues, use the FILENAME built-in variable.
 
Old 05-23-2008, 08:31 PM   #7
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Quote:
But later on when I do ldapsearch for those id's it exist in one db but not in the other? Is ithis finding the duplicates really? Not sure if it is working at the momnet...Could you please help?
It *should* work ... of course, if the uid's are like
User123 and USER123 it won't pick them up as dupes ...

Last edited by Tinkster; 05-23-2008 at 08:32 PM.
 
Old 05-24-2008, 01:44 AM   #8
cmontr
Member
 
Registered: Sep 2007
Posts: 175

Original Poster
Rep: Reputation: 15
Thanks for the help but I am still having issue with this:
When I used "grep '^uid: ' FILE1 | uniq >UIDS1"

Following error was received.

grep: illegal option -- f
Usage: grep -hblcnsviw pattern file

and when I used this "comm -12 UIDS1 UIDS2 >DuplicateUids"

I got ids printed from UIDS1 but when I did ldapsearch for random of those were not duplicated ones. It didnt exist in the other DB.

I am kind of stuck at the moment. If any idea would be very much appreciated.


Quote:
Originally Posted by jschiwal View Post
I don't know if you will have memory issues. You could try
grep '^uid: ' FILE1 | uniq >UIDS1
grep '^uid: ' FILE2 | uniq >UIDS2
grep -f UIDS1 FILE2

or

grep '^uid: ' FILE1 | sort | uniq >UIDS1
grep '^uid: ' FILE2 | sort | uniq >UIDS2
comm -12 UIDS1 UIDS2 >DuplicateUids

If one of these LDIF files is for a live ldap server, you could try using ldapsearch to extract the "uid=*" values instead of using grep.

If you want do do a lot of searches you might consider migrating the ldif files into a sleepycat, mysql or sqlite database and using sql commands to report the info you want.
 
Old 05-24-2008, 01:46 AM   #9
cmontr
Member
 
Registered: Sep 2007
Posts: 175

Original Poster
Rep: Reputation: 15
Could you give me an example running awk line against those 2 files please?

Thanks again


Quote:
Originally Posted by colucix View Post
Using gawk you can do something like
Code:
/^uid/ { uids[$2]++
}

END { for ( i in uids )
         if ( uids[i] > 1 )
              print i
}
This assumes that there are no duplicate uids in the same file. Otherwise you can try this
Code:
/^uid/ { if ( ARGIND == 2 )
              uids[$2]++ 
         else
              uids[$2] = 1
}

END { for ( i in uids )
         if ( uids[i] > 1 )
              print i
}
where the array element is incremented only while reading the second file. The ARGIND built-in variable is a GNU awk extension. You can run this program using the -f option to gawk and passing the two files as arguments. While reading the first file, ARGIND will be set to 1, while processing the second file ARGIND will be set to 2. For compatibility issues, use the FILENAME built-in variable.
 
Old 05-24-2008, 01:48 AM   #10
cmontr
Member
 
Registered: Sep 2007
Posts: 175

Original Poster
Rep: Reputation: 15
Mmm I am not sure what you meant really, I 'd appreciate if you explain. Purpose is to print duplicates in those 2 large files. Then I should be able to ldapsearch of those uids and see if they exist in both database. Please let me know.


Quote:
Originally Posted by Tinkster View Post
It *should* work ... of course, if the uid's are like
User123 and USER123 it won't pick them up as dupes ...
 
Old 05-24-2008, 03:04 AM   #11
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Quote:
Originally Posted by cmontr View Post
Could you give me an example running awk line against those 2 files please?

Thanks again
You asked for awk, I thought you knew how to run it. Anyway, either store the script in a file, e.g. dupes.awk, then launch this script giving the two files as arguments
Code:
gawk -f dupes.awk file1 file2
or directly from the command line digit
Code:
gawk '
/^uid/ { uids[$2]++
}
END { for ( i in uids )
         if ( uids[i] > 1 )
              print i
}' file1 file2
Better the first method, since you probably need to do some modifications to the awk script. Mine is only an example on how gawk can be used to solve this problem, given the conditions I explained in my previous post. Cheers!
 
Old 05-24-2008, 03:42 AM   #12
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
Quote:
Originally Posted by cmontr View Post
Thanks for the help but I am still having issue with this:
When I used "grep '^uid: ' FILE1 | uniq >UIDS1"

Following error was received.

grep: illegal option -- f
Usage: grep -hblcnsviw pattern file

and when I used this "comm -12 UIDS1 UIDS2 >DuplicateUids"

I got ids printed from UIDS1 but when I did ldapsearch for random of those were not duplicated ones. It didnt exist in the other DB.

I am kind of stuck at the moment. If any idea would be very much appreciated.
It looks like your grep grep doesn't support the "-f" option.
Before you used the "comm" program, did you use the grep lines that piped the output through sort? The two files need to be sorted. You could do something like
comm -12 <(grep '^uid:' file1 | sort) <(grep '^uid:' file2 | sort)

The comm command prints out 3 columns. 1st: unique to file 1; 2nd: unique to file2; 3rd: common to both.

I don't use ldap, so I 'm at a bit of a disadvantage. Are the ldif files something that are imported and incorporated in whatever backend the ldap server uses. If so, are you sure that they are current.
Are you comparing the UIDs of users on two different ldap servers. If so, why not run ldapsearch on both to produce the two lists. That way you will start with smaller files to work with that you know are current.

Quote:
I got ids printed from UIDS1 but when I did ldapsearch for random of those were not duplicated ones. It didnt exist in the other DB.
Take a couple of these examples and search on the corresponding ldif file that you started with.
This is something that should be verified.

Last edited by jschiwal; 05-24-2008 at 05:07 PM. Reason: fixed another typo
 
Old 05-24-2008, 06:54 AM   #13
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235
Quote:
Originally Posted by cmontr View Post
Hi again,

I ran the script as it once...It find uid's. But later on when I do ldapsearch for those id's it exist in one db but not in the other? Is ithis finding the duplicates really? Not sure if it is working at the momnet...Could you please help?

Thanks once again
hmmm... it should have worked i guess unless sort and/or uniq has it limits.. or perhaps you forgot to add the option '-d' to uniq..?
 
Old 05-24-2008, 12:00 PM   #14
cmontr
Member
 
Registered: Sep 2007
Posts: 175

Original Poster
Rep: Reputation: 15
I appreciate for all the assistance but it loks like server doesnt have gawk. I tried with nawk but is is bailing out as well.

nawk >>> ' <<<
nawk: bailing out at source line 6
usacpicnto01: oracle(vols1) /bu1/ora

Could you please show me where I am making a mistake? All I did was replaced the FILE1 and FILE2 with actual file names.

This is a Solaris server sparc SUNW,Sun-Fire-480R

Thanks millions

Quote:
Originally Posted by colucix View Post
You asked for awk, I thought you knew how to run it. Anyway, either store the script in a file, e.g. dupes.awk, then launch this script giving the two files as arguments
Code:
gawk -f dupes.awk file1 file2
or directly from the command line digit
Code:
gawk '
/^uid/ { uids[$2]++
}
END { for ( i in uids )
         if ( uids[i] > 1 )
              print i
}' file1 file2
Better the first method, since you probably need to do some modifications to the awk script. Mine is only an example on how gawk can be used to solve this problem, given the conditions I explained in my previous post. Cheers!
 
Old 05-24-2008, 12:36 PM   #15
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
What have you tried exactly? I tested the following
Code:
/^uid/ { uids[$2]++
}
END { for ( i in uids )
         if ( uids[i] > 1 )
              print i
}
and it works either with awk or nawk on a Solaris SPARC 5.8. Please, update your profile with your distribution or tell about what system the issue applies to, especially for non Linux OS. This will bring to more pertinent answers.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
sed and awk in shell script bondoq Linux - Newbie 14 07-27-2007 01:52 AM
ssimple shell script to parse a file ~sed or awk stevie_velvet Programming 7 07-14-2006 03:41 AM
Accessing Shell variable in awk script dileepkk Linux - General 1 10-07-2004 07:47 AM
Passing variables from AWK script to my shell script BigLarry Programming 1 06-12-2004 04:32 AM
How do I zip and attach the output data of a grep | awk | mail shell script? 360 Programming 1 05-08-2002 08:26 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:19 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration