LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-24-2008, 02:35 PM   #16
cmontr
Member
 
Registered: Sep 2007
Posts: 175

Original Poster
Rep: Reputation: 15

Hi again,

I acutally am running this:


awk -f dupes1.awk FILE1.ldif FILE2.ldif > DUPS

It has been over an hour, still running...I checked the process it is doing something but not sure ...File out DUPS is still at 0 byte...


Quote:
Originally Posted by colucix View Post
What have you tried exactly? I tested the following
Code:
/^uid/ { uids[$2]++
}
END { for ( i in uids )
         if ( uids[i] > 1 )
              print i
}
and it works either with awk or nawk on a Solaris SPARC 5.8. Please, update your profile with your distribution or tell about what system the issue applies to, especially for non Linux OS. This will bring to more pertinent answers.
 
Old 05-24-2008, 02:55 PM   #17
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978
Of course, the output is written at the end of the script. As noted in previous posts, managing so large files, require a lot of memory and a lot of cpu time. Just be patient...
 
Old 05-24-2008, 02:59 PM   #18
cmontr
Member
 
Registered: Sep 2007
Posts: 175

Original Poster
Rep: Reputation: 15
OK, great...thanks again..I will keep you posted with results...

Quote:
Originally Posted by colucix View Post
Of course, the output is written at the end of the script. As noted in previous posts, managing so large files, require a lot of memory and a lot of cpu time. Just be patient...
 
Old 05-25-2008, 04:36 AM   #19
angrybanana
Member
 
Registered: Oct 2003
Distribution: Archlinux
Posts: 147

Rep: Reputation: 21
Assumes files don't contain duplicates within themselves.
Code:
awk '$1 =="uid:" && !($2 in seen) {seen[$2];print}' file1 file2
 
Old 05-29-2008, 01:12 PM   #20
cmontr
Member
 
Registered: Sep 2007
Posts: 175

Original Poster
Rep: Reputation: 15
It looks like I was not able to process this successfully. I think in the two large files, uid info is case sensitive. The awk script also did not work..kinda stuck...if you can help, I'd really appreciate...

Quote:
Originally Posted by konsolebox View Post
my solution:
(a) extract the ids from the file
Code:
grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' > userids
(b) sort the file (may really require great amount of time and memory)
Code:
sort userids > userids.sorted
(c) extract the duplicate entries:
Code:
uniq -d userids.sorted > userids.dups
you can do everything in one shot but it might require great amount of memory, time and cpu usage (and probably also hang your pc):
Code:
grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' | sort | uniq -d > userids.dups
edit: btw for the step-by-step process, you can also compress the files to save hd space:
Code:
grep '^uid: ' FILE1 FILE2 | cut -f 2 -d ' ' | gzip -c -9 > userids.gz
zcat userids.gz | sort | gzip -c -9 > userids.sorted.gz
zcat userids.sorted.gz | uniq -d | gzip -c -9 > userids.dups.gz
 
Old 05-29-2008, 01:12 PM   #21
cmontr
Member
 
Registered: Sep 2007
Posts: 175

Original Poster
Rep: Reputation: 15
Sorry but I am unable to run this...can you please explain further ?

Quote:
Originally Posted by angrybanana View Post
Assumes files don't contain duplicates within themselves.
Code:
awk '$1 =="uid:" && !($2 in seen) {seen[$2];print}' file1 file2
 
Old 05-29-2008, 01:14 PM   #22
cmontr
Member
 
Registered: Sep 2007
Posts: 175

Original Poster
Rep: Reputation: 15
Hi Again, I have tried this few times for the big two large files I got...I waited so long but it did not work...There is enough memory and cpu capacity ...Pls help...thanks again...


Quote:
Originally Posted by cmontr View Post
Hi again,

I acutally am running this:


awk -f dupes1.awk FILE1.ldif FILE2.ldif > DUPS

It has been over an hour, still running...I checked the process it is doing something but not sure ...File out DUPS is still at 0 byte...
 
Old 05-29-2008, 06:34 PM   #23
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.10, Centos 7.3
Posts: 17,537

Rep: Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420
This works for me:

Code:
ldif1.txt
---------

dn: uid=USER123,ou=cnet,o=cbc.com
uid: USER123
cbcdomain: cbc.net


dn: uid=USER124,ou=cnet,o=cbc.com
uid: USER124
cbcdomain: cbc.net


ldif2.txt
---------
dn: uid=USER123,ou=cnet,o=cbc.com
uid: USER123
cbcdomain: cbc.net


dn: uid=USER125,ou=cnet,o=cbc.com
uid: USER125
cbcdomain: cbc.net

dn: uid=USER124,ou=cnet,o=cbc.com
uid: USER124
cbcdomain: cbc.net

dn: uid=USER126,ou=cnet,o=cbc.com
uid: USER126
cbcdomain: cbc.net


Perl
----

#!/usr/bin/perl -w
use strict;

my (
    $f1, $f1_rec, %uids, $uid, $f2, $f2_rec
    );

$f1 = $ARGV[0];
$f2 = $ARGV[1];

open(F1,"<", "$f1") or die "Unable to open $f1: $!\n";
while ( defined ( $f1_rec = <F1> ) )
{
    if( $f1_rec =~ /^uid/ )
    {
        # Get uid & store it
        $uid = (split(/ /, $f1_rec))[1];
        $uids{$uid} = 1;
    }
}
close(F1) or die "Unable to close $f1: $!\n";

open(F2,"<", "$f2") or die "Unable to open $f2: $!\n";
while ( defined ( $f2_rec = <F2> ) )
{
    if( $f2_rec =~ /^uid/ )
    {
        # Get uid & check it
        $uid = (split(/ /, $f2_rec))[1];
        if( exists($uids{$uid}) )
        {
            print "Dupe $uid\n";
        }
    }
}
close(F2) or die "Unable to close $f2: $!\n";
It only stores the uids in from one file in memory, so it should cope with large files.
 
Old 06-02-2008, 11:56 AM   #24
cmontr
Member
 
Registered: Sep 2007
Posts: 175

Original Poster
Rep: Reputation: 15
COuld you please direct me how to modify this perl script? If the file name 1 is AA and File name 2 is BB where do I replace with F or f 's ? Sorry I am confused a little. Thanks once again.

Quote:
Originally Posted by chrism01 View Post
This works for me:




Code:
ldif1.txt
---------

dn: uid=USER123,ou=cnet,o=cbc.com
uid: USER123
cbcdomain: cbc.net


dn: uid=USER124,ou=cnet,o=cbc.com
uid: USER124
cbcdomain: cbc.net


ldif2.txt
---------
dn: uid=USER123,ou=cnet,o=cbc.com
uid: USER123
cbcdomain: cbc.net


dn: uid=USER125,ou=cnet,o=cbc.com
uid: USER125
cbcdomain: cbc.net

dn: uid=USER124,ou=cnet,o=cbc.com
uid: USER124
cbcdomain: cbc.net

dn: uid=USER126,ou=cnet,o=cbc.com
uid: USER126
cbcdomain: cbc.net


Perl
----

#!/usr/bin/perl -w
use strict;

my (
    $f1, $f1_rec, %uids, $uid, $f2, $f2_rec
    );

$f1 = $ARGV[0];
$f2 = $ARGV[1];

open(F1,"<", "$f1") or die "Unable to open $f1: $!\n";
while ( defined ( $f1_rec = <F1> ) )
{
    if( $f1_rec =~ /^uid/ )
    {
        # Get uid & store it
        $uid = (split(/ /, $f1_rec))[1];
        $uids{$uid} = 1;
    }
}
close(F1) or die "Unable to close $f1: $!\n";

open(F2,"<", "$f2") or die "Unable to open $f2: $!\n";
while ( defined ( $f2_rec = <F2> ) )
{
    if( $f2_rec =~ /^uid/ )
    {
        # Get uid & check it
        $uid = (split(/ /, $f2_rec))[1];
        if( exists($uids{$uid}) )
        {
            print "Dupe $uid\n";
        }
    }
}
close(F2) or die "Unable to close $f2: $!\n";
It only stores the uids in from one file in memory, so it should cope with large files.
 
Old 06-02-2008, 04:20 PM   #25
osor
HCL Maintainer
 
Registered: Jan 2006
Distribution: (H)LFS, Gentoo
Posts: 2,450

Rep: Reputation: 76
Quote:
Originally Posted by cmontr View Post
COuld you please direct me how to modify this perl script? If the file name 1 is AA and File name 2 is BB where do I replace with F or f 's ?
The filenames are just the first and second arguments passed to the script.
 
Old 06-02-2008, 06:32 PM   #26
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.10, Centos 7.3
Posts: 17,537

Rep: Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420
As osor says, they are just vars that hold the filenames from the cmdline eg

./myperl.pl AA BB

sounds like you need to bookmark/read this: http://perldoc.perl.org/5.8.8/
 
Old 06-02-2008, 09:57 PM   #27
cmontr
Member
 
Registered: Sep 2007
Posts: 175

Original Poster
Rep: Reputation: 15
Hi Chris - Thanks for the link as well. I am running the perl script for this really large two files which each has about 15 GB data in. It has been doing something but, not sure what is happening in. Hopefully it won;t hurt server. So far outfile shows 0 byte. It has been like 2 hrs yet. Any ideas would be very much appreciated as usual.

Quote:
Originally Posted by chrism01 View Post
As osor says, they are just vars that hold the filenames from the cmdline eg

./myperl.pl AA BB

sounds like you need to bookmark/read this: http://perldoc.perl.org/5.8.8/
 
Old 06-02-2008, 10:49 PM   #28
cmontr
Member
 
Registered: Sep 2007
Posts: 175

Original Poster
Rep: Reputation: 15
It printed out approximately after 4 hours but the results were much less then I was expecting. So you thin if ther eis anything to do with the case sensitivity ? What I mean is if the script checks the case sensitive IDs?



Quote:
Originally Posted by cmontr View Post
Hi Chris - Thanks for the link as well. I am running the perl script for this really large two files which each has about 15 GB data in. It has been doing something but, not sure what is happening in. Hopefully it won;t hurt server. So far outfile shows 0 byte. It has been like 2 hrs yet. Any ideas would be very much appreciated as usual.
 
Old 06-03-2008, 12:04 AM   #29
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.10, Centos 7.3
Posts: 17,537

Rep: Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420Reputation: 2420
Perl, like Unix (and Unix based tools) is naturally case-sensitive. If you want to ignore case :

Where you store the uids in the 1st loop, change the line to

# Make uid lowercase ie lc($var)
$uids{lc($uid)} = 1;


When we do the compare in the 2nd loop, change to

# do lowercase compare
if( exists($uids{lc($uid)}) )

FYI, if you prefer uppercase, the fn is uc()

http://perldoc.perl.org/index-functions.html
 
Old 06-03-2008, 08:59 AM   #30
AnanthaP
Member
 
Registered: Jul 2004
Location: Chennai, India
Distribution: UBUNTU 5.10 since Jul-18,2006 on Intel 820 DC
Posts: 875

Rep: Reputation: 208Reputation: 208Reputation: 208
Whats the "challenge" portion about? Why should it be in awk?

End
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
sed and awk in shell script bondoq Linux - Newbie 14 07-27-2007 01:52 AM
ssimple shell script to parse a file ~sed or awk stevie_velvet Programming 7 07-14-2006 03:41 AM
Accessing Shell variable in awk script dileepkk Linux - General 1 10-07-2004 07:47 AM
Passing variables from AWK script to my shell script BigLarry Programming 1 06-12-2004 04:32 AM
How do I zip and attach the output data of a grep | awk | mail shell script? 360 Programming 1 05-08-2002 08:26 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 12:08 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration