[SOLVED] Trying to compare two files and output it into a third file.
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Now these two files are similar and they are not ordered, my quest is
to get the first column from the first file (ie "asdfkjsdlfkjsdf") and
read the second file to find the same exact phrase.
Once you get a match obtain both second columns (ie The numbers) and
output as follows:
I think you need know more about what these files are.
So I'm trying to sort out some DNA data I got, the first stage is to
compare which sequences appear to be common, and how many repeats or
"reads" occur.
The first field is the sequence (or tag in my code), the second is the
number of reads.
The data file 1 and 2 will therefore look like:
CAGCTCACTGCA 123
ACGTGCCCCCTT 847
etc... etc...
so if the tag on file 1 currently reading exists on file 2, I will grab both the read numbers and output it to a new file (ie file 3), like so
CAGCTCACTGCA 123 765
I've been writing this code and I have no idea why it doesn't work:
the usual inclues and opening file...
This is the bit I can't get it to work
Code:
// Read file1
while(!feof(file)){
// Get the tag sequence and the read number
fscanf(file, "%s", tag_1);
// Validate the tag is a sequence and not the reads
if(tag_1[0]=='A'||tag_1[0]=='C'||tag_1[0]=='G'||tag_1[0]=='T'){
// Read file2
fscanf(file2, "%s", tag_2);
// Validate the tag2 is a sequence and not the reads
if(tag_2[0]=='A'||tag_2[0]=='C'||tag_2[0]=='G'||tag_2[0]=='T'){
// Now compare tag1 with tag2 to see if they match
if(strcmp(tag_1, tag_2)==0){
printf("match!: %s", tag_1);
}
}
}
}
note this is by no means finish, I'm working in stages, but this is as
far as I got.
maybe im not understanding the problem, but here is my suggestion (your second post is the one that confused me).
assuming each sequence in a file is unique (may not, and will likely not, be unique across files though), read in each file (file A and file B) separately and store each one into its own hashmap (hash A and hash B) or other similar datatype, key=sequence, value=reads. then for each key in hash A, obtain the value of this key in hash B. if it exists, write this key with the value of this key at both hash A and hash B into a third file.
if you also need to do the same (ie find occurrences of sequences from hash B in hash A), then instead of writing to a third file as above use a third hash, hash C. save the value that combines the two values of the two hashes at this key in hash C, ie: hashC@key = hashA@key + " " + hashB@key. then do the same process (check hash B for values in hash A), and see if the value (hashA@key + " " + hashB@key) exists, if it does then you already determined the 'new'/combined value and dont have to add it. if it doesnt exist, it means that hashB@key occurs once, in file B. at the end, write hash C to a file if you need it in a file.
again, maybe i misunderstood the problem completely.
Your files are lists of key-value pairs, which is a perfect fit for the use of associative arrays, or in Perl-speak, hashes. You can read each file into a hash, where the first element of each record is the key, and the second element is the value. Repeat for the second file, using a separate hash. At this point you have some different possibilities with slightly different logic, depending on whether you want the intersection or the union (or possibly other relation) of the two files. Iterating over the the keys of one or each hash, you can print the values of one or both hashes.
Code:
#! /bin/perl -w
#
# LQSweetChris.pl
#
# Usage: LQSweetChris.pl file1.dat file2.dat
#
use strict;
my %file1;
my %file2;
open(FILE1,$ARGV[0]) || die "Cannot open $ARGV[0] : $!\n";
open(FILE2,$ARGV[1]) || die "Cannot open $ARGV[1] : $!\n";
while( my $rec = <FILE1> ){
my($key,$value) = split /\s+/, $rec;
$file1{$key} = $value;
}
while( my $rec = <FILE2> ){
my($key,$value) = split /\s+/, $rec;
$file2{$key} = $value;
}
foreach my $key ( sort keys %file1 ){
#
# Print intersection of files...
#
if( exists $file2{ $key } ){
print "$key ",$file1{$key}, " ", $file2{$key},"\n";
}
}
exit 0;
Sorry, I missed the requirement that it be done in C. No problem, though; you just have to write code to implement the functionality of associative arrays. Perhaps there is an existing library/API for doing this.
It might be easier to do the file pre-processing with Perl, and then read the processed files with C, which would be fairly trivial.
--- rod.
Thanks theNbomr and aspire1, you have both solved my problems, it seems that the perl script is much faster at doing this than the C program. I wonder why that would be?
the code in full for both versions of the program are:
Perl Version (By theNbomr):
Code:
#!/usr/bin/perl -w
# Usage: compare file1.txt file2.txt
use strict;
my %file1;
my %file2;
open(FILE1,$ARGV[0]) || die "Cannot open $ARGV[0] : $!\n";
open(FILE2,$ARGV[1]) || die "Cannot open $ARGV[1] : $!\n";
while( my $rec = <FILE1> ){
my($key,$value) = split /\s+/, $rec;
$file1{$key} = $value;
}
while( my $rec = <FILE2> ){
my($key,$value) = split /\s+/, $rec;
$file2{$key} = $value;
}
foreach my $key ( sort keys %file1 ){
# Print intersection of files...
if( exists $file2{ $key } ){
print "$key\t ",$file1{$key}, "\t", $file2{$key},"\n";
}
}
exit 0;
it seems that the perl script is much faster at doing this than the C program. I wonder why that would be?
Because the C program uses the entirely lame "N squared" approach of rewinding the second file for every line of the first file.
For serious amounts of data, you need some kind of associative container or priority queue. C doesn't have that built-in, you need to either find it or write it. Most modern programming languages (C++ for example) have associative containers and priority queues, etc. built in.
If the output should be in the same sequence as the first input file, you would want to swallow or index the second input file with some kind of associative container, then use that container as you process each line of the first file.
If the output has no reason to be in the same sequence as the first input file, it's even faster to swallow or index both input files to priority queues (which can be more than twice as fast as a similar capacity associative containers) then match the lines as they are pulled from the two priority queues.
All that can be done in C. In C++ most of the work is already done for you with almost no performance compromises vs. the best C code you could write. In Perl even more of the work is already done for you, with greater performance compromises, but as you just discovered: a fundamentally bad algorithm in a fast language is still slower than a better algorithm in a slower language.
a fundamentally bad algorithm in a fast language is still slower than a better algorithm in a slower language.
I wanted to add that I hate being on that side of this question.
I prefer the fast language (C or C++) and I prefer writing the better algorithm even if it is more coding because you chose the "fast" language. If you will be running this code on enough data, at some point the extra effort of choosing C or C++ over perl (and still choosing the better algorithm) would pay back.
I've written priority queues and associative containers in C (because if you don't write it you can't use it). I've written priority queues and associative containers in C++, because I can do a slightly better job than the people who wrote the standard template library ones and I've used them in places where even slightly better matters.
But for almost anyone else, I really question the choice of C for something where C++ is clearly better (and Perl is easier without being significantly worse).
Edit: Also I skimmed the thread too quickly earlier and missed the detail that a hashed container is good enough. I imagined a need for a sequenced associative container that wasn't described at all. So depending on exactly what you want (intersection vs. union vs. more complex handling of mismatch) priority queues might not be as good as the hashed container. Of course all smart container approaches are far better than the N squared approach as soon as you have any significant amount of data. Any comparison between various smart container approaches or between languages with smart containers is a subtle difference compared to their difference from the N squared approach.
Yup, O(n2) ain't a good measure by anyone's books. 'Twas a "quick answer" and the approach taken should be dependent upon work put in for expected results. That should be taken for granted
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.