[SOLVED] A challenging script - Replace field of CSV file based on another CSV file

arbex5 · 06-07-2013, 04:40 PM

Hi,

I have a request that I think can be solved with a script.

I have two files, one is a GeoIP data from Maxmind, lets call File1, and the other is a file with cities and zone names, lets call File2. I need to replace a field in Maxmind that is numeric for a field in the other file wich is a name. Below there is part of each file.

File1

locId,country,region,city,postalCode,latitude,longitude
19319,"BR","21","Rio De Janeiro","",-22.9000,-43.2333,,
19304,"BR","27","São Paulo","",-23.4733,-46.6658,,
156797,"BR","29","Goias","",-15.9333,-50.1333,,
187217,"BR","14","Mato Grosso","",-15.2500,-56.7333,,

File2

zone_id, country, region, region_code, status
448, BR, 'Goias', 'GO', 1
450, BR, 'Mato Grosso', 'MT', 1
458, BR, 'Rio de Janeiro', 'RJ', 1
464, BR, 'Sao Paulo', 'SP', 1

Basically I need to replace all fields in Bold of File1 to it's corresponding fields in Bold and Red of File2 using the the region as a reference.

The end result should be a file File3 like this:

File3

locId,country,region,city,postalCode,latitude,longitude
19319,"BR","RJ","Rio De Janeiro","",-22.9000,-43.2333,,
19304,"BR","SP","São Paulo","",-23.4733,-46.6658,,
156797,"BR","GO","Goias","",-15.9333,-50.1333,,
187217,"BR","MT","Mato Grosso","",-15.2500,-56.7333,,

This file is huge, so make it manually is unpractical. And I have no idea how to accomplish it. I hope you do.
Thank you.

Ser Olmy · 06-07-2013, 06:07 PM

If I understand you correctly, the pseudocode would look something like this:

Code:

while not EOF <file1> do
  read LINE from <file1>
  let CITY be field 4 of LINE where separator = ","
  let REGION be field 4 of (search <file2> for line containing CITY) where separator = ","
  let MODIFIED_LINE be (replace field 3 of LINE where separator = "," with REGION)
  write MODIFIED_LINE to <file3>
done

Does that look about right?

Edit: This should do the trick.

Code:

#!/bin/bash

geoip=file1
zonenames=file2
outfile=file3

exec 3< $geoip

# copy the header line verbatim from the source file
read line <&3
echo $line > $outfile

while read line <&3 ; do
  city=$(echo $line | cut --delimiter=, --fields 4 | sed "s/^\([\"']\)\(.*\)\1\$/\2/g")
  # Strip diacritical marks
  city="$(echo $city | iconv --to-code=ascii//TRANSLIT)"
  old_region=$(echo $line | cut --delimiter=, --fields 3 | sed "s/^\([\"']\)\(.*\)\1\$/\2/g")
  new_region=$(cat $zonenames | grep -ia "$city" | cut --delimiter=, --fields 4 | sed "s/[' ]//g")
  modified_line=$(echo $line | sed "s/$old_region/$new_region/" )
  echo $modified_line>> $outfile
done

exec 3>&-

This is probably not the best/optimal solution, but it works on the test data you provided.

If it weren't for the fact that the fields are quoted differently in the two files (and the use of diactitics in the first file), this would have been a perfect job for join.

grail · 06-07-2013, 11:10 PM

Did you search this site at all for this type of problem? There are a plethora of questions asked on this topic of multiple files being used to update / create a new file.

The simplest would be with a simple awk script. My only question I can see from looking at your data would be, is it important that the quotes are changed from single to double as per your example?

arbex5 · 06-08-2013, 09:55 AM

Hi Ser Olmy,
Thank you very much, I really appreciate your help

I followed your pseudocode and it's simple perfect to me.

First thing, I replaced the single quote from File2 to double quotes.
Then named the files on the script, and executed it.

I'm getting some issues but I think is related to some strange characters.
Right now, I dealing with them.

I'll keep trying then I post you back.
Thank you again.

grail · 06-08-2013, 12:10 PM

well, if we can assume the fields are containing the correct data, ie Rio de Janeiro != Rio De Janeiro,
then it could be as simple as:

Code:

awk -F" *, *" 'FNR==NR{gsub(/\047/,"\"");id[$3]=$4;next}$3 = id[$4]' OFS="," file2 file1 > file3

If the header line is still required simply change last part to:

Code:

FNR == 1 || $3 = id[$4]

danielbmartin · 06-08-2013, 01:30 PM

Quote:

Originally Posted by grail

Rio de Janeiro != Rio De Janeiro

Similarly, São Paulo != Sao Paulo.

Computers are picky about such seemingly minor discrepancies.

Daniel B. Martin

arbex5 · 06-08-2013, 10:30 PM

Hi Ser Olmy,
I triyed after adjust the data but I got an error, I don't know what is.
here is the error message:

Code:

sed: -e expression #1, char 7: unterminated `s' command

Hi grail,

I tested your script and it worked very nice and very fast.
Thank you very much.

Hi Daniel,
I was aware of that, so I used the command below to remove those accents:

Code:

$ unaccent ISO-8859-1 < myfile > myfile.unaccent

Thank you all guys.
I really appreciate your help.

I'll mark as SOLVED.

David the H. · 06-09-2013, 09:15 AM

Please use ***[code][/code]*** tags around your code and data, to preserve the original formatting and to improve readability. Do not use quote tags, bolding, colors, "start/end" lines, or other creative techniques. Thanks.

You know, I don't think this really has to be done in such a complex line-by-line way. Since each number apparently corresponds to a specific fixed region, then it may instead be more efficient to simply set up a one-time master array of num=name pairs (perhaps in a separate file), and then simply run over the input file replacing the one with the other.

arbex5 · 06-09-2013, 09:50 AM

Hi David,

Changed the code part.

I understand your idea of create this array, but then it would be necessary to create it first, right?
How could this being accomplished?

Thank you.

David the H. · 06-11-2013, 11:14 AM

It's quite easy, really. At least with the input you supplied.

Using these two files:

Code:

#codefile.txt
"14","MT"
"21","RJ"
"27","SP"
"29","GO"

#infile.txt
locId,country,region,city,postalCode,latitude,longitude
19319,"BR","21","Rio De Janeiro","",-22.9000,-43.2333,,
19304,"BR","27","São Paulo","",-23.4733,-46.6658,,
156797,"BR","29","Goias","",-15.9333,-50.1333,,
187217,"BR","14","Mato Grosso","",-15.2500,-56.7333,,

And this command:

Code:

awk 'BEGIN{ FS=OFS="," } NR==FNR { a[$1]=$2 } NR!=FNR && FNR>1{ $3=a[$3]; print } NR!=FNR && FNR==1' codefile.txt infile.txt

I get this output:

Code:

locId,country,region,city,postalCode,latitude,longitude
19319,"BR","RJ","Rio De Janeiro","",-22.9000,-43.2333,,
19304,"BR","SP","São Paulo","",-23.4733,-46.6658,,
156797,"BR","GO","Goias","",-15.9333,-50.1333,,
187217,"BR","MT","Mato Grosso","",-15.2500,-56.7333,,

The trickiest parts are in getting it to handle the quotes and header line correctly. But I was able to simplify that a lot by simply keeping the quote marks on the entries in the codefile. Otherwise it would involve a more complicated process of removing them, replacing the number, and re-adding them to the output.

arbex5 · 06-11-2013, 11:40 AM

Yeah I see.
Quite a beautiful solution too.
I'll study them.

Thank you very much David.

grail · 06-12-2013, 06:56 AM

If I may ... small adjustment using David's files, if you use the command next in the first test, the rest is reduced greatly:

Code:

awk 'BEGIN{ FS=OFS="," } NR==FNR { a[$1]=$2;next } FNR>1{ $3=a[$3] } 1' codefile.txt infile.txt

If you are new to awk, the above is still fairly clear.

It can be altered again to look like:

Code:

awk 'BEGIN{ FS=OFS="," } NR==FNR { a[$1]=$2;next } FNR == 1 || $3=a[$3] ' codefile.txt infile.txt

This can be a little confusing at first, but awk's default for a true action is to print