[SOLVED] Trying to compare two files and output it into a third file.

chutsu · 07-29-2009, 03:33 PM

Ok. So basically I have two files in the form of:

File1:
asdfkjsdlfkjsdf 1232
afasdfklsdjfksf 12312
sdflsadsdffdsfs 32323

File2:
asdfkjsdlfkjsdf 1232
afasdfklsdjfksf 12312
sdflsadsdffdsfs 32323

Now these two files are similar and they are not ordered, my quest is
to get the first column from the first file (ie "asdfkjsdlfkjsdf") and
read the second file to find the same exact phrase.

Once you get a match obtain both second columns (ie The numbers) and
output as follows:

File 3:
asdfdsfdsfdssa 1232 133
asdfdsfdsfdssa 1232 133
asdfdsfdsfdssa 1232 133

I need to code this in C, as I might need to do some complex numerical analysis on it.

Can someone help me, I have no idea how to approach this.
Thanks
Chris

elprawn · 07-29-2009, 06:09 PM

Are the two input files the same length?

chutsu · 07-29-2009, 06:21 PM

No the two are not the same length

I think you need know more about what these files are.
So I'm trying to sort out some DNA data I got, the first stage is to
compare which sequences appear to be common, and how many repeats or
"reads" occur.
The first field is the sequence (or tag in my code), the second is the
number of reads.
The data file 1 and 2 will therefore look like:
CAGCTCACTGCA 123
ACGTGCCCCCTT 847
etc... etc...

so if the tag on file 1 currently reading exists on file 2, I will grab both the read numbers and output it to a new file (ie file 3), like so

CAGCTCACTGCA 123 765

I've been writing this code and I have no idea why it doesn't work:

the usual inclues and opening file...
This is the bit I can't get it to work

Code:

// Read file1
while(!feof(file)){

    // Get the tag sequence and the read number
    fscanf(file, "%s", tag_1);

    // Validate the tag is a sequence and not the reads
    if(tag_1[0]=='A'||tag_1[0]=='C'||tag_1[0]=='G'||tag_1[0]=='T'){

        // Read file2
        fscanf(file2, "%s", tag_2);

        // Validate the tag2 is a sequence and not the reads
        if(tag_2[0]=='A'||tag_2[0]=='C'||tag_2[0]=='G'||tag_2[0]=='T'){

            // Now compare tag1 with tag2 to see if they match
            if(strcmp(tag_1, tag_2)==0){
                printf("match!: %s", tag_1);
            }
        }
    }

}

note this is by no means finish, I'm working in stages, but this is as
far as I got.

nadroj · 07-29-2009, 06:41 PM

maybe im not understanding the problem, but here is my suggestion (your second post is the one that confused me).

assuming each sequence in a file is unique (may not, and will likely not, be unique across files though), read in each file (file A and file B) separately and store each one into its own hashmap (hash A and hash B) or other similar datatype, key=sequence, value=reads. then for each key in hash A, obtain the value of this key in hash B. if it exists, write this key with the value of this key at both hash A and hash B into a third file.

if you also need to do the same (ie find occurrences of sequences from hash B in hash A), then instead of writing to a third file as above use a third hash, hash C. save the value that combines the two values of the two hashes at this key in hash C, ie: hashC@key = hashA@key + " " + hashB@key. then do the same process (check hash B for values in hash A), and see if the value (hashA@key + " " + hashB@key) exists, if it does then you already determined the 'new'/combined value and dont have to add it. if it doesnt exist, it means that hashB@key occurs once, in file B. at the end, write hash C to a file if you need it in a file.

again, maybe i misunderstood the problem completely.

theNbomr · 07-30-2009, 08:38 AM

Your files are lists of key-value pairs, which is a perfect fit for the use of associative arrays, or in Perl-speak, hashes. You can read each file into a hash, where the first element of each record is the key, and the second element is the value. Repeat for the second file, using a separate hash. At this point you have some different possibilities with slightly different logic, depending on whether you want the intersection or the union (or possibly other relation) of the two files. Iterating over the the keys of one or each hash, you can print the values of one or both hashes.

Code:

#! /bin/perl -w
#
#   LQSweetChris.pl
#
#   Usage: LQSweetChris.pl file1.dat file2.dat
#
use strict;
my %file1;
my %file2;

    open(FILE1,$ARGV[0]) || die "Cannot open $ARGV[0] : $!\n";
    open(FILE2,$ARGV[1]) || die "Cannot open $ARGV[1] : $!\n";
    
    while( my $rec = <FILE1> ){
        my($key,$value) = split /\s+/, $rec;
        $file1{$key} = $value;
    }

    while( my $rec = <FILE2> ){
        my($key,$value) = split /\s+/, $rec;
        $file2{$key} = $value;
    }
    
    foreach my $key ( sort keys %file1 ){
        #
        #   Print intersection of files...
        #
        if( exists $file2{ $key } ){
            print "$key ",$file1{$key}, " ", $file2{$key},"\n";
        }
    }
    exit 0;

chutsu · 07-30-2009, 11:33 AM

Wow.. thanks..

I know for every language they have their advantages, but I was wondering how would you achieve the same with C though.
Just out of curiosity

Thanks a million
Chris

theNbomr · 07-30-2009, 11:58 AM

Sorry, I missed the requirement that it be done in C. No problem, though; you just have to write code to implement the functionality of associative arrays. Perhaps there is an existing library/API for doing this.
It might be easier to do the file pre-processing with Perl, and then read the processed files with C, which would be fairly trivial.
--- rod.

aspire1 · 07-30-2009, 03:55 PM

As a quick attempt and if I've understood correctly, this should do what you want, all be it in probably not the most efficient manner:

Code:

while( 1 )
{
	fscanf(file1, "%s %d", line1, &num1);
		
	if( feof( file1) ){
		break;
	}		
		
	while( 1 )
	{
			
	   fscanf(file2, "%s %d", line2, &num2);
		
	   if( feof( file2) )
	   {
		break;
	   }
			
			
	  //Write to file3 here??
	  if( strcmp( line1, line2) == 0)
	  {
	   	 printf("%s %d %d\n", line1, num1, num2);
	   }
				
	 }
		
	rewind( file2);
}

chutsu · 07-30-2009, 04:15 PM

Thanks theNbomr and aspire1, you have both solved my problems, it seems that the perl script is much faster at doing this than the C program. I wonder why that would be?

the code in full for both versions of the program are:

Perl Version (By theNbomr):

Code:

#!/usr/bin/perl -w
# Usage: compare file1.txt file2.txt
use strict;
my %file1;
my %file2;

    open(FILE1,$ARGV[0]) || die "Cannot open $ARGV[0] : $!\n";
    open(FILE2,$ARGV[1]) || die "Cannot open $ARGV[1] : $!\n";
    
    while( my $rec = <FILE1> ){
        my($key,$value) = split /\s+/, $rec;
        $file1{$key} = $value;
    }

    while( my $rec = <FILE2> ){
        my($key,$value) = split /\s+/, $rec;
        $file2{$key} = $value;
    }
    
    foreach my $key ( sort keys %file1 ){
        #   Print intersection of files...
        if( exists $file2{ $key } ){
            print "$key\t ",$file1{$key}, "\t", $file2{$key},"\n";
        }
    }
    exit 0;

The C Version (By aspire1):

Code:

#include <stdio.h>
#include <string.h>

struct data{
	char tag[21];
	int reads;
};

int main(int argc, char * argv[])
{
	
	char *file_path="../../data/clustered_tags/clustered_tags_DB2.txt";
	char *file_path2="../../data/clustered_tags/clustered_tags_SC3.txt";
	struct data file1;
	struct data file2;
	FILE *datafile1;	
	FILE *datafile2;

	
	// Opening file
	datafile1 = fopen( file_path, "r" );
	datafile2 = fopen( file_path2, "r" ); 
		 
	if(datafile1==NULL || datafile2==NULL) {
		printf("Error: can't open file.\n");
		return 1;
	}
	else {
	    printf("File opened!\n");
	}
	
	while(1){
		fscanf(datafile1, "%s %d", file1.tag, &file1.reads);
		if(feof(datafile1))
			break;
		while( 1 ){
			fscanf(datafile2, "%s %d", file2.tag, &file2.reads);
	   		if(feof(datafile2))
				break;
			//Write to file3 here??
			if( strcmp( file1.tag, file2.tag) == 0)
				printf("%s\t%d\t%d\n", file1.tag, file1.reads, file2.reads);
		}
		rewind( datafile2);
	}
		
    fclose(datafile1);
    fclose(datafile2);
    return 0;
}

johnsfine · 07-30-2009, 04:34 PM

Quote:

Originally Posted by SweetChris

it seems that the perl script is much faster at doing this than the C program. I wonder why that would be?

Because the C program uses the entirely lame "N squared" approach of rewinding the second file for every line of the first file.

For serious amounts of data, you need some kind of associative container or priority queue. C doesn't have that built-in, you need to either find it or write it. Most modern programming languages (C++ for example) have associative containers and priority queues, etc. built in.

If the output should be in the same sequence as the first input file, you would want to swallow or index the second input file with some kind of associative container, then use that container as you process each line of the first file.

If the output has no reason to be in the same sequence as the first input file, it's even faster to swallow or index both input files to priority queues (which can be more than twice as fast as a similar capacity associative containers) then match the lines as they are pulled from the two priority queues.

All that can be done in C. In C++ most of the work is already done for you with almost no performance compromises vs. the best C code you could write. In Perl even more of the work is already done for you, with greater performance compromises, but as you just discovered: a fundamentally bad algorithm in a fast language is still slower than a better algorithm in a slower language.

johnsfine · 07-30-2009, 04:49 PM

Quote:

Originally Posted by johnsfine

a fundamentally bad algorithm in a fast language is still slower than a better algorithm in a slower language.

I wanted to add that I hate being on that side of this question.

I prefer the fast language (C or C++) and I prefer writing the better algorithm even if it is more coding because you chose the "fast" language. If you will be running this code on enough data, at some point the extra effort of choosing C or C++ over perl (and still choosing the better algorithm) would pay back.

I've written priority queues and associative containers in C (because if you don't write it you can't use it). I've written priority queues and associative containers in C++, because I can do a slightly better job than the people who wrote the standard template library ones and I've used them in places where even slightly better matters.

But for almost anyone else, I really question the choice of C for something where C++ is clearly better (and Perl is easier without being significantly worse).

Edit: Also I skimmed the thread too quickly earlier and missed the detail that a hashed container is good enough. I imagined a need for a sequenced associative container that wasn't described at all. So depending on exactly what you want (intersection vs. union vs. more complex handling of mismatch) priority queues might not be as good as the hashed container. Of course all smart container approaches are far better than the N squared approach as soon as you have any significant amount of data. Any comparison between various smart container approaches or between languages with smart containers is a subtle difference compared to their difference from the N squared approach.

aspire1 · 07-31-2009, 06:55 PM

Yup, O(n2) ain't a good measure by anyone's books. 'Twas a "quick answer" and the approach taken should be dependent upon work put in for expected results. That should be taken for granted