[SOLVED] Need help with awk and two csv: field match / substitution

archduke83 · 04-26-2012, 04:37 AM

Hi everybody,
I'm a newbie in awk and i'm trying to understand how to achieve this:
I have two .csv files, with semi-colon separated values
file1 is the "map" file, structured like this:

CITY_NAME;NEW_REGION;OLD_REGION;NEW_ZIP_CODE;OLD_ZIP_CODE

and file2 is the data file:

NEW_REGION;CITY_NAME;STREET_NAME;NEW_ZIP_CODE

basically, i have to match field 1, 2, and 4 from file1 with, respectively, field 2,1 and 4 from file 2,
whenever a match is found, makem a substitution in file2 with the OLD_REGION and OLD_ZIP field from file1
and produce a third file with just the records modified.

I'm desperately trying to understand the awk 'NR==FNR{a[$1]=$3;next... logic, but I'mreally stuck at the moment.
Could you please give me any help, that would be much appreciated, thanks.
Cheers,
Archduke83

pan64 · 04-26-2012, 06:21 AM

so you have an awk script? Just post it and ask what is unclear.
start to read this: http://www.gnu.org/software/gawk/man...etting-Started, 1.3 Some Simple Examples

grail · 04-26-2012, 07:01 AM

pan64 has the good oil on where to start. As for the snippet:

Code:

NR==FNR{a[$1]=$3;next}

NR - number of records in all files passed to awk

FNR - number of records per file starting at 1 each time a new file is started

So the first part will only ever happen for the first file passed to awk. After this the NR number will keep increasing to finally be the total of all FNRs for all files

a[$1] = $3 - create array "a" with an index equal to the value in the first field ($1) and assign the value of the third field ($3). As per pan64's link you will find a section on arrays
saying how all indexes are strings, even if they look like numbers.

next - process the next line from the current file

Hope that helps.

archduke83 · 04-27-2012, 07:05 AM

Ok, i came up with this:

nawk -F";" 'NR==FNR{a[$1FS$2]=$1FS$2;print a[$1FS$2];next} $2FS$1 in a {printf $2FS$1 "\n"}' file1 file2

But it doesn't seem to act properly, I just set the correct pattern in memory from the 1st file, but that doesn't match with the pattern i create from the second file.

What am I missing?
Thank you for your help guys.

grail · 04-27-2012, 09:37 AM

How does what you have written reflect your original query?

Code:

nawk -F";" 'NR==FNR{a[$1FS$2]=$1FS$2;print a[$1FS$2];next} $2FS$1 in a {printf $2FS$1 "\n"}' file1 file2

a[$1FS$2]=$1FS$2 - i have to match field 1, 2, and 4 from file1, OLD_REGION and OLD_ZIP

So your example makes no use of $4 and is not saving OLD_REGION and OLD_ZIP, which are fields 3 and 5

print a[$1FS$2] - Why? In the output you asked for, "makem a substitution in file2 with the OLD_REGION and OLD_ZIP field from file1". So no request to have field 1 & 2 from file 1

$2FS$1 in a - This part makes sense but of course is still missing field 4

printf $2FS$1 "\n" - Just use print as it applies a new line automatically.

You may need to read the link provided by pan64 a little further.

Also, be cautious as nawk does not support all the features of gawk, which is what the link provides details on.

archduke83 · 04-27-2012, 10:57 AM

ok, sorry if I made a mistake before, I didn't mention the right fields. Here I'm trying to make the replacements (the fields are correct). Now it's:

nawk -F";" 'NR==FNR{a[$1]=$1;b[$1]=$4;c[$1]=$2;d[$1]=$3;next}
{for (cont in a) if(match(a[cont],$2) > 0 && match(b[cont],$1) > 0)
gsub($2,c[cont]);gsub($1,d[cont])}' map_prov_tc.csv data1.csv

When, instead of putting gsub, I print the correspondences (print $2FS$1 "->" c[cont] FS d[cont]}) It retrieves the correct matches.
But I can't replace the correct values.

archduke83 · 04-27-2012, 11:33 AM

This seems to do the job:

nawk -F";" 'NR==FNR{a[$1]=$1;b[$1]=$4;c[$1]=$2;d[$1]=$3;next}
{for (cont in a) if(match(a[cont],$2) > 0 && match(b[cont],$1) > 0)
sub($2,c[cont]) sub($1,d[cont])}1' map_prov_tc.csv data1.csv

grail · 04-27-2012, 11:34 AM

When you use gsub you are changing the value of the line, but if you never print the line it cannot be displayed or redirected into another file.

grail · 04-27-2012, 11:41 AM

Yes, as now you have printed the results ... good work

Here is an alternative:

Code:

awk 'BEGIN{OFS=FS=";"}NR==FNR{a[$1$2$4]=$3FS$5;next}split(a[$2$1$4],b){$1=b[1];$4=b[2]}1' file1 file2

archduke83 · 04-27-2012, 11:49 AM

Thank you both grail and pan64