LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   CSV Merge (http://www.linuxquestions.org/questions/programming-9/csv-merge-606320/)

CorvusE 12-12-2007 11:14 AM

CSV Merge
 
I've found solutions to similar situations online, but I'm going crazy trying to get them to work correctly for my particular need. I've been pursuing an awk solution as it seems to be the simplest tool for my needs, but I'm open to other solutions as well.

I have two CSV files. One we'll call the master file. It looks like this:

Category,Keyword,Rel01,Rel02,rel03,blank,notes
category01,keyword01,relword01,relword02,relword03,,
category01,keyword02,relword04,relword05,relword06,,
category01,keyword03,relword06,relword02,relword03,,
category02,keyword04,relword02,relword05,relword07,,
category02,keyword01,relword01,relword03,relword08,,


Note: The only unique key is the combination of field 1 and 2. In other words, keywords are not unique, except within a category. There are thousands of records in this file.

I have several files containing subsets of that data, which have been edited by my client. The returned file might look a little like this:

category01,keyword01,relword09,relword10,relword11,,belongs elsewhere?
category01,keyword02,relword04,relword05,relword12,,
category01,keyword07,relword06,relword02,relword03,,removed plural

I need parse each line of the edited file, find the matching category/keyword combo in the master file and overwrite it (true, I don't technically need to overwrite if the complete record is the same, but a hammer will work fine for this job). If it doesn't exist (because the client changed the keyword), I need to append the line to the master file.

I can handle the sorting of everything once the merge takes place (and the deletion of obsolete words from the master file), so we don't need to worry about that.

Trying to find a solution has been fun (albeit seriously maddening), but it's taking me away from the primary work of adding data to the master file.

Can anyone throw me a lifeline or point me to where a lifeline already exists? Thanks in advance!

matthewg42 12-12-2007 12:35 PM

I wonder if join could be used for this? AFAIK it is limited to having a single keyfield, but I would love to know if there is a way to specify more than one key field.

Assuming a simpler join or sort based solution cannot be found, it is not a difficult task for the likes of Awk or Perl. Here's my first go at a Perl solution:
Code:

#!/usr/bin/perl
#
# save this in a file called process_returns.pl, chmod 755 that file, and then
# invoke it like this:
#
#  process_returns.pl mfile rfile1 rfile2 ... > new_mfile
#
# where:
# - mfile is the old master file
# - rfile1, rfile2 etc are returns files
# - new_mfile is the new master file (with updates in it)
#
# Author:  M N Gates
# License: Public Domain
#

use strict;
use warnings;


# Save the master file name for later...
my $master_filename = shift || usage(1);

# just in case the user requested command line help...
if ($master_filename =~ /^(-h|--help)$/) { usage(0); }

# This is where we store the updated records
# the key will be the concatenation of field1 and field2
my %updated;

# Read records from remaining files specified on comman line
# These are the "returned" files - with updated records.

while(<>) {
        my($key);
        # This regular expression extracts the first two fields including
        # the comma which separates them
        if ( /^([^,]*,[^,]*),/ ) {
                $key = $1;
        }
        else {
                warn "WARNING: record looks weird, SKIPPING: $_";
                next;
        }

        # this will happen if the key is used more than once in the return files.
        if ( defined($updated{$key}) ) {
                warn "WARNING: duplicate definition of key: $key\n";
        }

        $updated{$key} = $_;
}

# Now read the master file.  For each input line see if the
# key is defined in the updated hash, and use the data from there
# if it is, else just spew the input line.
open(MASTER, "<$master_filename") || die "cannot open $master_filename for reading: $!\n";
while(<MASTER>) {
        my($key);
        if ( /^([^,]*,[^,]*),/ ) {
                if ( defined($updated{$1}) ) {
                        print $updated{$1};
                }
                else {
                        print;
                }
        }
        else {
                print;
        }
}

sub usage {
        my $r = shift || 0;
        print "Usage:\n    $0 masterfile [rfile1 [rfile2] [...]]\n";
        exit($r);
}


CorvusE 12-12-2007 01:06 PM

I'm subjecting this to some pretty rigorous testing, but at first glaze it appears to be exactly what I need.

Were I so equipped, I would offer to bear your children. As it is, I'll have to settle for offering my deepest thanks.

radoulov 12-12-2007 03:37 PM

With Awk:

Code:

awk 'NR == FNR {
        rfi[$1,$2] = FNR
        rfd[FNR] = $0
        next
}
{
        print (($1,$2) in rfi) ? rfd[rfi[$1,$2]] : $0
        mf[$1,$2]
} END {
for (e in rfi)
        if (!(e in mf))
                print rfd[rfi[e]]
}' FS="," updated master > new_master


I suppose you should update from one returned file
at a time, otherwise you have to decide how to handle
duplicate keys if they exist.

If every returned file contain a different subset
(duplicate keys are not possible):

Code:

awk 'FILENAME != "master" {
        rfi[$1,$2] = NR
        rfd[NR] = $0
        next
}
{
        print (($1,$2) in rfi) ? rfd[rfi[$1,$2]] : $0
        mf[$1,$2]
} END {
for (e in rfi)
        if (!(e in mf))
                print rfd[rfi[e]]
}' FS="," updated1 updated2 ... updatedn master > new_master



All times are GMT -5. The time now is 07:20 PM.