Shell Script / Awk help for a challenge

cmontr · 05-22-2008, 07:46 PM

I hope someone can help me with this... I have two large ldif files (more then 10GB) that contains users information. I would like to match these two large files and print out duplicated uid's matching in them.

in the ldif files user information looks like:

dn: uid=USER123,ou=cnet,o=cbc.com
uid: USER123
cbcdomain: cbc.net

I appreciate for the help

konsolebox · 05-22-2008, 10:49 PM

my solution:
(a) extract the ids from the file

Code:

grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' > userids

(b) sort the file (may really require great amount of time and memory)

Code:

sort userids > userids.sorted

(c) extract the duplicate entries:

Code:

uniq -d userids.sorted > userids.dups

you can do everything in one shot but it might require great amount of memory, time and cpu usage (and probably also hang your pc):

Code:

grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' | sort | uniq -d > userids.dups

edit: btw for the step-by-step process, you can also compress the files to save hd space:

Code:

grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' | gzip -c -9 > userids.gz
zcat userids.gz | sort | gzip -c -9 > userids.sorted.gz
zcat userids.sorted.gz | uniq -d | gzip -c -9 > userids.dups.gz

cmontr · 05-23-2008, 10:48 AM

I appreciate for the assistance, I will run and let you know later today.

Quote:

Originally Posted by konsolebox

my solution:
(a) extract the ids from the file

Code:

grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' > userids

(b) sort the file (may really require great amount of time and memory)

Code:

sort userids > userids.sorted

(c) extract the duplicate entries:

Code:

uniq -d userids.sorted > userids.dups

you can do everything in one shot but it might require great amount of memory, time and cpu usage (and probably also hang your pc):

Code:

grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' | sort | uniq -d > userids.dups

edit: btw for the step-by-step process, you can also compress the files to save hd space:

Code:

grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' | gzip -c -9 > userids.gz
zcat userids.gz | sort | gzip -c -9 > userids.sorted.gz
zcat userids.sorted.gz | uniq -d | gzip -c -9 > userids.dups.gz

cmontr · 05-23-2008, 03:31 PM

Hi again,

I ran the script as it once...It find uid's. But later on when I do ldapsearch for those id's it exist in one db but not in the other? Is ithis finding the duplicates really? Not sure if it is working at the momnet...Could you please help?

Thanks once again

Quote:

Originally Posted by cmontr

I appreciate for the assistance, I will run and let you know later today.

jschiwal · 05-23-2008, 04:55 PM

I don't know if you will have memory issues. You could try
grep '^uid: ' FILE1 | uniq >UIDS1
grep '^uid: ' FILE2 | uniq >UIDS2
grep -f UIDS1 FILE2

or

grep '^uid: ' FILE1 | sort | uniq >UIDS1
grep '^uid: ' FILE2 | sort | uniq >UIDS2
comm -12 UIDS1 UIDS2 >DuplicateUids

If one of these LDIF files is for a live ldap server, you could try using ldapsearch to extract the "uid=*" values instead of using grep.

If you want do do a lot of searches you might consider migrating the ldif files into a sleepycat, mysql or sqlite database and using sql commands to report the info you want.

colucix · 05-23-2008, 05:32 PM

Using gawk you can do something like

Code:

/^uid/ { uids[$2]++
}

END { for ( i in uids )
         if ( uids[i] > 1 )
              print i
}

This assumes that there are no duplicate uids in the same file. Otherwise you can try this

Code:

/^uid/ { if ( ARGIND == 2 )
              uids[$2]++ 
         else
              uids[$2] = 1
}

END { for ( i in uids )
         if ( uids[i] > 1 )
              print i
}

where the array element is incremented only while reading the second file. The ARGIND built-in variable is a GNU awk extension. You can run this program using the -f option to gawk and passing the two files as arguments. While reading the first file, ARGIND will be set to 1, while processing the second file ARGIND will be set to 2. For compatibility issues, use the FILENAME built-in variable.

Tinkster · 05-23-2008, 08:31 PM

Quote:

But later on when I do ldapsearch for those id's it exist in one db but not in the other? Is ithis finding the duplicates really? Not sure if it is working at the momnet...Could you please help?

It *should* work ... of course, if the uid's are like
User123 and USER123 it won't pick them up as dupes ...

cmontr · 05-24-2008, 01:44 AM

Thanks for the help but I am still having issue with this:
When I used "grep '^uid: ' FILE1 | uniq >UIDS1"

Following error was received.

grep: illegal option -- f
Usage: grep -hblcnsviw pattern file

and when I used this "comm -12 UIDS1 UIDS2 >DuplicateUids"

I got ids printed from UIDS1 but when I did ldapsearch for random of those were not duplicated ones. It didnt exist in the other DB.

I am kind of stuck at the moment. If any idea would be very much appreciated.

Quote:

Originally Posted by jschiwal

I don't know if you will have memory issues. You could try
grep '^uid: ' FILE1 | uniq >UIDS1
grep '^uid: ' FILE2 | uniq >UIDS2
grep -f UIDS1 FILE2

or

grep '^uid: ' FILE1 | sort | uniq >UIDS1
grep '^uid: ' FILE2 | sort | uniq >UIDS2
comm -12 UIDS1 UIDS2 >DuplicateUids

If one of these LDIF files is for a live ldap server, you could try using ldapsearch to extract the "uid=*" values instead of using grep.

If you want do do a lot of searches you might consider migrating the ldif files into a sleepycat, mysql or sqlite database and using sql commands to report the info you want.

cmontr · 05-24-2008, 01:46 AM

Could you give me an example running awk line against those 2 files please?

Thanks again

Quote:

Originally Posted by colucix

Using gawk you can do something like

Code:

/^uid/ { uids[$2]++
}

END { for ( i in uids )
         if ( uids[i] > 1 )
              print i
}

This assumes that there are no duplicate uids in the same file. Otherwise you can try this

Code:

/^uid/ { if ( ARGIND == 2 )
              uids[$2]++ 
         else
              uids[$2] = 1
}

END { for ( i in uids )
         if ( uids[i] > 1 )
              print i
}

where the array element is incremented only while reading the second file. The ARGIND built-in variable is a GNU awk extension. You can run this program using the -f option to gawk and passing the two files as arguments. While reading the first file, ARGIND will be set to 1, while processing the second file ARGIND will be set to 2. For compatibility issues, use the FILENAME built-in variable.

cmontr · 05-24-2008, 01:48 AM

Mmm I am not sure what you meant really, I 'd appreciate if you explain. Purpose is to print duplicates in those 2 large files. Then I should be able to ldapsearch of those uids and see if they exist in both database. Please let me know.

Quote:

Originally Posted by Tinkster

It *should* work ... of course, if the uid's are like
User123 and USER123 it won't pick them up as dupes ...

colucix · 05-24-2008, 03:04 AM

Quote:

Originally Posted by cmontr

Could you give me an example running awk line against those 2 files please?

Thanks again

You asked for awk, I thought you knew how to run it. Anyway, either store the script in a file, e.g. dupes.awk, then launch this script giving the two files as arguments

Code:

gawk -f dupes.awk file1 file2

or directly from the command line digit

Code:

gawk '
/^uid/ { uids[$2]++
}
END { for ( i in uids )
         if ( uids[i] > 1 )
              print i
}' file1 file2

Better the first method, since you probably need to do some modifications to the awk script. Mine is only an example on how gawk can be used to solve this problem, given the conditions I explained in my previous post. Cheers!

jschiwal · 05-24-2008, 03:42 AM

Quote:

Originally Posted by cmontr

Thanks for the help but I am still having issue with this:
When I used "grep '^uid: ' FILE1 | uniq >UIDS1"

Following error was received.

grep: illegal option -- f
Usage: grep -hblcnsviw pattern file

and when I used this "comm -12 UIDS1 UIDS2 >DuplicateUids"

I got ids printed from UIDS1 but when I did ldapsearch for random of those were not duplicated ones. It didnt exist in the other DB.

I am kind of stuck at the moment. If any idea would be very much appreciated.

It looks like your grep grep doesn't support the "-f" option.
Before you used the "comm" program, did you use the grep lines that piped the output through sort? The two files need to be sorted. You could do something like
comm -12 <(grep '^uid:' file1 | sort) <(grep '^uid:' file2 | sort)

The comm command prints out 3 columns. 1st: unique to file 1; 2nd: unique to file2; 3rd: common to both.

I don't use ldap, so I 'm at a bit of a disadvantage. Are the ldif files something that are imported and incorporated in whatever backend the ldap server uses. If so, are you sure that they are current.
Are you comparing the UIDs of users on two different ldap servers. If so, why not run ldapsearch on both to produce the two lists. That way you will start with smaller files to work with that you know are current.

Quote:

I got ids printed from UIDS1 but when I did ldapsearch for random of those were not duplicated ones. It didnt exist in the other DB.

Take a couple of these examples and search on the corresponding ldif file that you started with.
This is something that should be verified.

konsolebox · 05-24-2008, 06:54 AM

Quote:

Originally Posted by cmontr

Hi again,

I ran the script as it once...It find uid's. But later on when I do ldapsearch for those id's it exist in one db but not in the other? Is ithis finding the duplicates really? Not sure if it is working at the momnet...Could you please help?

Thanks once again

hmmm... it should have worked i guess unless sort and/or uniq has it limits.. or perhaps you forgot to add the option '-d' to uniq..?

cmontr · 05-24-2008, 12:00 PM

I appreciate for all the assistance but it loks like server doesnt have gawk. I tried with nawk but is is bailing out as well.

nawk >>> ' <<<
nawk: bailing out at source line 6
usacpicnto01: oracle(vols1) /bu1/ora

Could you please show me where I am making a mistake? All I did was replaced the FILE1 and FILE2 with actual file names.

This is a Solaris server sparc SUNW,Sun-Fire-480R

Thanks millions

Quote:

Originally Posted by colucix

You asked for awk, I thought you knew how to run it. Anyway, either store the script in a file, e.g. dupes.awk, then launch this script giving the two files as arguments

Code:

gawk -f dupes.awk file1 file2

or directly from the command line digit

Code:

gawk '
/^uid/ { uids[$2]++
}
END { for ( i in uids )
         if ( uids[i] > 1 )
              print i
}' file1 file2

Better the first method, since you probably need to do some modifications to the awk script. Mine is only an example on how gawk can be used to solve this problem, given the conditions I explained in my previous post. Cheers!

colucix · 05-24-2008, 12:36 PM

What have you tried exactly? I tested the following

Code:

/^uid/ { uids[$2]++
}
END { for ( i in uids )
         if ( uids[i] > 1 )
              print i
}

and it works either with awk or nawk on a Solaris SPARC 5.8. Please, update your profile with your distribution or tell about what system the issue applies to, especially for non Linux OS. This will bring to more pertinent answers.