LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 07-29-2009, 03:33 PM   #1
chutsu
Member
 
Registered: Nov 2003
Location: UK
Distribution: Debian Lenny
Posts: 255

Rep: Reputation: 31
Trying to compare two files and output it into a third file.


Ok. So basically I have two files in the form of:

File1:
asdfkjsdlfkjsdf 1232
afasdfklsdjfksf 12312
sdflsadsdffdsfs 32323

File2:
asdfkjsdlfkjsdf 1232
afasdfklsdjfksf 12312
sdflsadsdffdsfs 32323

Now these two files are similar and they are not ordered, my quest is
to get the first column from the first file (ie "asdfkjsdlfkjsdf") and
read the second file to find the same exact phrase.

Once you get a match obtain both second columns (ie The numbers) and
output as follows:

File 3:
asdfdsfdsfdssa 1232 133
asdfdsfdsfdssa 1232 133
asdfdsfdsfdssa 1232 133

I need to code this in C, as I might need to do some complex numerical analysis on it.

Can someone help me, I have no idea how to approach this.
Thanks
Chris
 
Old 07-29-2009, 06:09 PM   #2
elprawn
Member
 
Registered: Feb 2005
Distribution: Gentoo 2008
Posts: 138

Rep: Reputation: 15
Are the two input files the same length?
 
Old 07-29-2009, 06:21 PM   #3
chutsu
Member
 
Registered: Nov 2003
Location: UK
Distribution: Debian Lenny
Posts: 255

Original Poster
Rep: Reputation: 31
No the two are not the same length

I think you need know more about what these files are.
So I'm trying to sort out some DNA data I got, the first stage is to
compare which sequences appear to be common, and how many repeats or
"reads" occur.
The first field is the sequence (or tag in my code), the second is the
number of reads.
The data file 1 and 2 will therefore look like:
CAGCTCACTGCA 123
ACGTGCCCCCTT 847
etc... etc...

so if the tag on file 1 currently reading exists on file 2, I will grab both the read numbers and output it to a new file (ie file 3), like so

CAGCTCACTGCA 123 765


I've been writing this code and I have no idea why it doesn't work:

the usual inclues and opening file...
This is the bit I can't get it to work
Code:
// Read file1
while(!feof(file)){

    // Get the tag sequence and the read number
    fscanf(file, "%s", tag_1);

    // Validate the tag is a sequence and not the reads
    if(tag_1[0]=='A'||tag_1[0]=='C'||tag_1[0]=='G'||tag_1[0]=='T'){

        // Read file2
        fscanf(file2, "%s", tag_2);

        // Validate the tag2 is a sequence and not the reads
        if(tag_2[0]=='A'||tag_2[0]=='C'||tag_2[0]=='G'||tag_2[0]=='T'){

            // Now compare tag1 with tag2 to see if they match
            if(strcmp(tag_1, tag_2)==0){
                printf("match!: %s", tag_1);
            }
        }
    }

}
note this is by no means finish, I'm working in stages, but this is as
far as I got.

Last edited by chutsu; 07-29-2009 at 06:22 PM.
 
Old 07-29-2009, 06:41 PM   #4
nadroj
Senior Member
 
Registered: Jan 2005
Location: Canada
Distribution: ubuntu
Posts: 2,539

Rep: Reputation: 58
maybe im not understanding the problem, but here is my suggestion (your second post is the one that confused me).

assuming each sequence in a file is unique (may not, and will likely not, be unique across files though), read in each file (file A and file B) separately and store each one into its own hashmap (hash A and hash B) or other similar datatype, key=sequence, value=reads. then for each key in hash A, obtain the value of this key in hash B. if it exists, write this key with the value of this key at both hash A and hash B into a third file.

if you also need to do the same (ie find occurrences of sequences from hash B in hash A), then instead of writing to a third file as above use a third hash, hash C. save the value that combines the two values of the two hashes at this key in hash C, ie: hashC@key = hashA@key + " " + hashB@key. then do the same process (check hash B for values in hash A), and see if the value (hashA@key + " " + hashB@key) exists, if it does then you already determined the 'new'/combined value and dont have to add it. if it doesnt exist, it means that hashB@key occurs once, in file B. at the end, write hash C to a file if you need it in a file.

again, maybe i misunderstood the problem completely.

Last edited by nadroj; 07-29-2009 at 06:43 PM.
 
Old 07-30-2009, 08:38 AM   #5
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,395
Blog Entries: 2

Rep: Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903
Your files are lists of key-value pairs, which is a perfect fit for the use of associative arrays, or in Perl-speak, hashes. You can read each file into a hash, where the first element of each record is the key, and the second element is the value. Repeat for the second file, using a separate hash. At this point you have some different possibilities with slightly different logic, depending on whether you want the intersection or the union (or possibly other relation) of the two files. Iterating over the the keys of one or each hash, you can print the values of one or both hashes.
Code:
#! /bin/perl -w
#
#   LQSweetChris.pl
#
#   Usage: LQSweetChris.pl file1.dat file2.dat
#
use strict;
my %file1;
my %file2;

    open(FILE1,$ARGV[0]) || die "Cannot open $ARGV[0] : $!\n";
    open(FILE2,$ARGV[1]) || die "Cannot open $ARGV[1] : $!\n";
    
    while( my $rec = <FILE1> ){
        my($key,$value) = split /\s+/, $rec;
        $file1{$key} = $value;
    }

    while( my $rec = <FILE2> ){
        my($key,$value) = split /\s+/, $rec;
        $file2{$key} = $value;
    }
    
    foreach my $key ( sort keys %file1 ){
        #
        #   Print intersection of files...
        #
        if( exists $file2{ $key } ){
            print "$key ",$file1{$key}, " ", $file2{$key},"\n";
        }
    }
    exit 0;

Last edited by theNbomr; 07-30-2009 at 08:41 AM.
 
Old 07-30-2009, 11:33 AM   #6
chutsu
Member
 
Registered: Nov 2003
Location: UK
Distribution: Debian Lenny
Posts: 255

Original Poster
Rep: Reputation: 31
Wow.. thanks..

I know for every language they have their advantages, but I was wondering how would you achieve the same with C though.
Just out of curiosity

Thanks a million
Chris
 
Old 07-30-2009, 11:58 AM   #7
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,395
Blog Entries: 2

Rep: Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903Reputation: 903
Sorry, I missed the requirement that it be done in C. No problem, though; you just have to write code to implement the functionality of associative arrays. Perhaps there is an existing library/API for doing this.
It might be easier to do the file pre-processing with Perl, and then read the processed files with C, which would be fairly trivial.
--- rod.
 
Old 07-30-2009, 03:55 PM   #8
aspire1
Member
 
Registered: Dec 2008
Distribution: Ubuntu
Posts: 61

Rep: Reputation: 22
As a quick attempt and if I've understood correctly, this should do what you want, all be it in probably not the most efficient manner:

Code:
while( 1 )
{
	fscanf(file1, "%s %d", line1, &num1);
		
	if( feof( file1) ){
		break;
	}		
		
	while( 1 )
	{
			
	   fscanf(file2, "%s %d", line2, &num2);
		
	   if( feof( file2) )
	   {
		break;
	   }
			
			
	  //Write to file3 here??
	  if( strcmp( line1, line2) == 0)
	  {
	   	 printf("%s %d %d\n", line1, num1, num2);
	   }
				
	 }
		
	rewind( file2);
}
 
Old 07-30-2009, 04:15 PM   #9
chutsu
Member
 
Registered: Nov 2003
Location: UK
Distribution: Debian Lenny
Posts: 255

Original Poster
Rep: Reputation: 31
Thanks theNbomr and aspire1, you have both solved my problems, it seems that the perl script is much faster at doing this than the C program. I wonder why that would be?

the code in full for both versions of the program are:

Perl Version (By theNbomr):
Code:
#!/usr/bin/perl -w
# Usage: compare file1.txt file2.txt
use strict;
my %file1;
my %file2;

    open(FILE1,$ARGV[0]) || die "Cannot open $ARGV[0] : $!\n";
    open(FILE2,$ARGV[1]) || die "Cannot open $ARGV[1] : $!\n";
    
    while( my $rec = <FILE1> ){
        my($key,$value) = split /\s+/, $rec;
        $file1{$key} = $value;
    }

    while( my $rec = <FILE2> ){
        my($key,$value) = split /\s+/, $rec;
        $file2{$key} = $value;
    }
    
    foreach my $key ( sort keys %file1 ){
        #   Print intersection of files...
        if( exists $file2{ $key } ){
            print "$key\t ",$file1{$key}, "\t", $file2{$key},"\n";
        }
    }
    exit 0;
The C Version (By aspire1):
Code:
#include <stdio.h>
#include <string.h>

struct data{
	char tag[21];
	int reads;
};

int main(int argc, char * argv[])
{
	
	char *file_path="../../data/clustered_tags/clustered_tags_DB2.txt";
	char *file_path2="../../data/clustered_tags/clustered_tags_SC3.txt";
	struct data file1;
	struct data file2;
	FILE *datafile1;	
	FILE *datafile2;

	
	// Opening file
	datafile1 = fopen( file_path, "r" );
	datafile2 = fopen( file_path2, "r" ); 
		 
	if(datafile1==NULL || datafile2==NULL) {
		printf("Error: can't open file.\n");
		return 1;
	}
	else {
	    printf("File opened!\n");
	}
	
	while(1){
		fscanf(datafile1, "%s %d", file1.tag, &file1.reads);
		if(feof(datafile1))
			break;
		while( 1 ){
			fscanf(datafile2, "%s %d", file2.tag, &file2.reads);
	   		if(feof(datafile2))
				break;
			//Write to file3 here??
			if( strcmp( file1.tag, file2.tag) == 0)
				printf("%s\t%d\t%d\n", file1.tag, file1.reads, file2.reads);
		}
		rewind( datafile2);
	}
		
    fclose(datafile1);
    fclose(datafile2);
    return 0;
}

Last edited by chutsu; 07-30-2009 at 04:28 PM.
 
Old 07-30-2009, 04:34 PM   #10
johnsfine
Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,076

Rep: Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110
Quote:
Originally Posted by SweetChris View Post
it seems that the perl script is much faster at doing this than the C program. I wonder why that would be?
Because the C program uses the entirely lame "N squared" approach of rewinding the second file for every line of the first file.

For serious amounts of data, you need some kind of associative container or priority queue. C doesn't have that built-in, you need to either find it or write it. Most modern programming languages (C++ for example) have associative containers and priority queues, etc. built in.

If the output should be in the same sequence as the first input file, you would want to swallow or index the second input file with some kind of associative container, then use that container as you process each line of the first file.

If the output has no reason to be in the same sequence as the first input file, it's even faster to swallow or index both input files to priority queues (which can be more than twice as fast as a similar capacity associative containers) then match the lines as they are pulled from the two priority queues.

All that can be done in C. In C++ most of the work is already done for you with almost no performance compromises vs. the best C code you could write. In Perl even more of the work is already done for you, with greater performance compromises, but as you just discovered: a fundamentally bad algorithm in a fast language is still slower than a better algorithm in a slower language.

Last edited by johnsfine; 07-30-2009 at 04:37 PM.
 
Old 07-30-2009, 04:49 PM   #11
johnsfine
Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,076

Rep: Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110Reputation: 1110
Quote:
Originally Posted by johnsfine View Post
a fundamentally bad algorithm in a fast language is still slower than a better algorithm in a slower language.
I wanted to add that I hate being on that side of this question.

I prefer the fast language (C or C++) and I prefer writing the better algorithm even if it is more coding because you chose the "fast" language. If you will be running this code on enough data, at some point the extra effort of choosing C or C++ over perl (and still choosing the better algorithm) would pay back.

I've written priority queues and associative containers in C (because if you don't write it you can't use it). I've written priority queues and associative containers in C++, because I can do a slightly better job than the people who wrote the standard template library ones and I've used them in places where even slightly better matters.

But for almost anyone else, I really question the choice of C for something where C++ is clearly better (and Perl is easier without being significantly worse).

Edit: Also I skimmed the thread too quickly earlier and missed the detail that a hashed container is good enough. I imagined a need for a sequenced associative container that wasn't described at all. So depending on exactly what you want (intersection vs. union vs. more complex handling of mismatch) priority queues might not be as good as the hashed container. Of course all smart container approaches are far better than the N squared approach as soon as you have any significant amount of data. Any comparison between various smart container approaches or between languages with smart containers is a subtle difference compared to their difference from the N squared approach.

Last edited by johnsfine; 07-30-2009 at 05:00 PM.
 
Old 07-31-2009, 06:55 PM   #12
aspire1
Member
 
Registered: Dec 2008
Distribution: Ubuntu
Posts: 61

Rep: Reputation: 22
Yup, O(n2) ain't a good measure by anyone's books. 'Twas a "quick answer" and the approach taken should be dependent upon work put in for expected results. That should be taken for granted
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
How to compare files in 2 dirs and write results on a file fpp666 Programming 9 07-15-2009 12:27 PM
Bash: Insert output of a commando into an array and compare the values tengblad Programming 2 04-07-2009 03:36 PM
Combine output of multiple files in one CSV file say_hi_ravi Programming 4 07-17-2008 03:04 AM
Split a large file and get the names of output files using Perl Sherlock Programming 25 02-02-2007 12:43 PM
How to compare the output of a command with a string? daYz Programming 2 12-31-2006 09:36 AM


All times are GMT -5. The time now is 07:27 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration