LinuxQuestions.org
Did you know LQ has a Linux Hardware Compatibility List?
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
Search this Thread
Old 12-12-2007, 11:14 AM   #1
CorvusE
LQ Newbie
 
Registered: Jul 2005
Posts: 2

Rep: Reputation: 0
CSV Merge


I've found solutions to similar situations online, but I'm going crazy trying to get them to work correctly for my particular need. I've been pursuing an awk solution as it seems to be the simplest tool for my needs, but I'm open to other solutions as well.

I have two CSV files. One we'll call the master file. It looks like this:

Category,Keyword,Rel01,Rel02,rel03,blank,notes
category01,keyword01,relword01,relword02,relword03,,
category01,keyword02,relword04,relword05,relword06,,
category01,keyword03,relword06,relword02,relword03,,
category02,keyword04,relword02,relword05,relword07,,
category02,keyword01,relword01,relword03,relword08,,


Note: The only unique key is the combination of field 1 and 2. In other words, keywords are not unique, except within a category. There are thousands of records in this file.

I have several files containing subsets of that data, which have been edited by my client. The returned file might look a little like this:

category01,keyword01,relword09,relword10,relword11,,belongs elsewhere?
category01,keyword02,relword04,relword05,relword12,,
category01,keyword07,relword06,relword02,relword03,,removed plural

I need parse each line of the edited file, find the matching category/keyword combo in the master file and overwrite it (true, I don't technically need to overwrite if the complete record is the same, but a hammer will work fine for this job). If it doesn't exist (because the client changed the keyword), I need to append the line to the master file.

I can handle the sorting of everything once the merge takes place (and the deletion of obsolete words from the master file), so we don't need to worry about that.

Trying to find a solution has been fun (albeit seriously maddening), but it's taking me away from the primary work of adding data to the master file.

Can anyone throw me a lifeline or point me to where a lifeline already exists? Thanks in advance!
 
Old 12-12-2007, 12:35 PM   #2
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 62
I wonder if join could be used for this? AFAIK it is limited to having a single keyfield, but I would love to know if there is a way to specify more than one key field.

Assuming a simpler join or sort based solution cannot be found, it is not a difficult task for the likes of Awk or Perl. Here's my first go at a Perl solution:
Code:
#!/usr/bin/perl
#
# save this in a file called process_returns.pl, chmod 755 that file, and then
# invoke it like this:
#
#   process_returns.pl mfile rfile1 rfile2 ... > new_mfile
#
# where:
# - mfile is the old master file
# - rfile1, rfile2 etc are returns files
# - new_mfile is the new master file (with updates in it)
#
# Author:  M N Gates
# License: Public Domain
#

use strict;
use warnings;


# Save the master file name for later...
my $master_filename = shift || usage(1);

# just in case the user requested command line help...
if ($master_filename =~ /^(-h|--help)$/) { usage(0); }

# This is where we store the updated records
# the key will be the concatenation of field1 and field2
my %updated;

# Read records from remaining files specified on comman line
# These are the "returned" files - with updated records.

while(<>) {
        my($key);
        # This regular expression extracts the first two fields including
        # the comma which separates them
        if ( /^([^,]*,[^,]*),/ ) {
                $key = $1;
        }
        else {
                warn "WARNING: record looks weird, SKIPPING: $_";
                next;
        }

        # this will happen if the key is used more than once in the return files.
        if ( defined($updated{$key}) ) {
                warn "WARNING: duplicate definition of key: $key\n";
        }

        $updated{$key} = $_;
}

# Now read the master file.  For each input line see if the
# key is defined in the updated hash, and use the data from there
# if it is, else just spew the input line.
open(MASTER, "<$master_filename") || die "cannot open $master_filename for reading: $!\n";
while(<MASTER>) {
        my($key);
        if ( /^([^,]*,[^,]*),/ ) {
                if ( defined($updated{$1}) ) {
                        print $updated{$1};
                }
                else {
                        print;
                }
        }
        else {
                print;
        }
}

sub usage {
        my $r = shift || 0;
        print "Usage:\n    $0 masterfile [rfile1 [rfile2] [...]]\n";
        exit($r);
}
 
Old 12-12-2007, 01:06 PM   #3
CorvusE
LQ Newbie
 
Registered: Jul 2005
Posts: 2

Original Poster
Rep: Reputation: 0
I'm subjecting this to some pretty rigorous testing, but at first glaze it appears to be exactly what I need.

Were I so equipped, I would offer to bear your children. As it is, I'll have to settle for offering my deepest thanks.
 
Old 12-12-2007, 03:37 PM   #4
radoulov
Member
 
Registered: Apr 2007
Location: Milano, Italia/Варна, България
Distribution: Ubuntu, Open SUSE
Posts: 212

Rep: Reputation: 35
With Awk:

Code:
awk 'NR == FNR {
	rfi[$1,$2] = FNR 
	rfd[FNR] = $0
	next
}
{
	print (($1,$2) in rfi) ? rfd[rfi[$1,$2]] : $0
	mf[$1,$2]
} END {
for (e in rfi)
	if (!(e in mf))
		print rfd[rfi[e]]
}' FS="," updated master > new_master

I suppose you should update from one returned file
at a time, otherwise you have to decide how to handle
duplicate keys if they exist.

If every returned file contain a different subset
(duplicate keys are not possible):

Code:
awk 'FILENAME != "master" {
	rfi[$1,$2] = NR 
	rfd[NR] = $0
	next
}
{
	print (($1,$2) in rfi) ? rfd[rfi[$1,$2]] : $0
	mf[$1,$2]
} END {
for (e in rfi)
	if (!(e in mf))
		print rfd[rfi[e]]
}' FS="," updated1 updated2 ... updatedn master > new_master

Last edited by radoulov; 12-12-2007 at 04:21 PM. Reason: modified ...
 
  


Reply

Tags
awk, csv, help, scripting


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: KHTML Vs Webkit: To Merge or Not To Merge LXer Syndicated Linux News 0 10-27-2007 06:41 AM
convert and merge a fat32 to ext3 and then merge w/ another ext3? nkoplm Linux - General 3 03-23-2006 10:37 PM
CSV and CUT wwnexc Linux - Software 11 01-26-2006 11:23 PM
CSV File AMMullan Programming 2 11-10-2003 12:49 AM
tar and CSV lhoff Linux - Newbie 1 10-29-2002 12:23 PM


All times are GMT -5. The time now is 01:57 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration