ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
I've found solutions to similar situations online, but I'm going crazy trying to get them to work correctly for my particular need. I've been pursuing an awk solution as it seems to be the simplest tool for my needs, but I'm open to other solutions as well.
I have two CSV files. One we'll call the master file. It looks like this:
Note: The only unique key is the combination of field 1 and 2. In other words, keywords are not unique, except within a category. There are thousands of records in this file.
I have several files containing subsets of that data, which have been edited by my client. The returned file might look a little like this:
I need parse each line of the edited file, find the matching category/keyword combo in the master file and overwrite it (true, I don't technically need to overwrite if the complete record is the same, but a hammer will work fine for this job). If it doesn't exist (because the client changed the keyword), I need to append the line to the master file.
I can handle the sorting of everything once the merge takes place (and the deletion of obsolete words from the master file), so we don't need to worry about that.
Trying to find a solution has been fun (albeit seriously maddening), but it's taking me away from the primary work of adding data to the master file.
Can anyone throw me a lifeline or point me to where a lifeline already exists? Thanks in advance!
I wonder if join could be used for this? AFAIK it is limited to having a single keyfield, but I would love to know if there is a way to specify more than one key field.
Assuming a simpler join or sort based solution cannot be found, it is not a difficult task for the likes of Awk or Perl. Here's my first go at a Perl solution:
Code:
#!/usr/bin/perl
#
# save this in a file called process_returns.pl, chmod 755 that file, and then
# invoke it like this:
#
# process_returns.pl mfile rfile1 rfile2 ... > new_mfile
#
# where:
# - mfile is the old master file
# - rfile1, rfile2 etc are returns files
# - new_mfile is the new master file (with updates in it)
#
# Author: M N Gates
# License: Public Domain
#
use strict;
use warnings;
# Save the master file name for later...
my $master_filename = shift || usage(1);
# just in case the user requested command line help...
if ($master_filename =~ /^(-h|--help)$/) { usage(0); }
# This is where we store the updated records
# the key will be the concatenation of field1 and field2
my %updated;
# Read records from remaining files specified on comman line
# These are the "returned" files - with updated records.
while(<>) {
my($key);
# This regular expression extracts the first two fields including
# the comma which separates them
if ( /^([^,]*,[^,]*),/ ) {
$key = $1;
}
else {
warn "WARNING: record looks weird, SKIPPING: $_";
next;
}
# this will happen if the key is used more than once in the return files.
if ( defined($updated{$key}) ) {
warn "WARNING: duplicate definition of key: $key\n";
}
$updated{$key} = $_;
}
# Now read the master file. For each input line see if the
# key is defined in the updated hash, and use the data from there
# if it is, else just spew the input line.
open(MASTER, "<$master_filename") || die "cannot open $master_filename for reading: $!\n";
while(<MASTER>) {
my($key);
if ( /^([^,]*,[^,]*),/ ) {
if ( defined($updated{$1}) ) {
print $updated{$1};
}
else {
print;
}
}
else {
print;
}
}
sub usage {
my $r = shift || 0;
print "Usage:\n $0 masterfile [rfile1 [rfile2] [...]]\n";
exit($r);
}
awk 'NR == FNR {
rfi[$1,$2] = FNR
rfd[FNR] = $0
next
}
{
print (($1,$2) in rfi) ? rfd[rfi[$1,$2]] : $0
mf[$1,$2]
} END {
for (e in rfi)
if (!(e in mf))
print rfd[rfi[e]]
}' FS="," updated master > new_master
I suppose you should update from one returned file
at a time, otherwise you have to decide how to handle
duplicate keys if they exist.
If every returned file contain a different subset
(duplicate keys are not possible):
Code:
awk 'FILENAME != "master" {
rfi[$1,$2] = NR
rfd[NR] = $0
next
}
{
print (($1,$2) in rfi) ? rfd[rfi[$1,$2]] : $0
mf[$1,$2]
} END {
for (e in rfi)
if (!(e in mf))
print rfd[rfi[e]]
}' FS="," updated1 updated2 ... updatedn master > new_master
Last edited by radoulov; 12-12-2007 at 04:21 PM.
Reason: modified ...
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.