[SOLVED] A challenging script - Replace field of CSV file based on another CSV file
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
A challenging script - Replace field of CSV file based on another CSV file
Hi,
I have a request that I think can be solved with a script.
I have two files, one is a GeoIP data from Maxmind, lets call File1, and the other is a file with cities and zone names, lets call File2. I need to replace a field in Maxmind that is numeric for a field in the other file wich is a name. Below there is part of each file.
File1
locId,country,region,city,postalCode,latitude,longitude
19319,"BR","21","Rio De Janeiro","",-22.9000,-43.2333,,
19304,"BR","27","São Paulo","",-23.4733,-46.6658,,
156797,"BR","29","Goias","",-15.9333,-50.1333,,
187217,"BR","14","Mato Grosso","",-15.2500,-56.7333,,
If I understand you correctly, the pseudocode would look something like this:
Code:
while not EOF <file1> do
read LINE from <file1>
let CITY be field 4 of LINE where separator = ","
let REGION be field 4 of (search <file2> for line containing CITY) where separator = ","
let MODIFIED_LINE be (replace field 3 of LINE where separator = "," with REGION)
write MODIFIED_LINE to <file3>
done
Does that look about right?
Edit: This should do the trick.
Code:
#!/bin/bash
geoip=file1
zonenames=file2
outfile=file3
exec 3< $geoip
# copy the header line verbatim from the source file
read line <&3
echo $line > $outfile
while read line <&3 ; do
city=$(echo $line | cut --delimiter=, --fields 4 | sed "s/^\([\"']\)\(.*\)\1\$/\2/g")
# Strip diacritical marks
city="$(echo $city | iconv --to-code=ascii//TRANSLIT)"
old_region=$(echo $line | cut --delimiter=, --fields 3 | sed "s/^\([\"']\)\(.*\)\1\$/\2/g")
new_region=$(cat $zonenames | grep -ia "$city" | cut --delimiter=, --fields 4 | sed "s/[' ]//g")
modified_line=$(echo $line | sed "s/$old_region/$new_region/" )
echo $modified_line>> $outfile
done
exec 3>&-
This is probably not the best/optimal solution, but it works on the test data you provided.
If it weren't for the fact that the fields are quoted differently in the two files (and the use of diactitics in the first file), this would have been a perfect job for join.
Did you search this site at all for this type of problem? There are a plethora of questions asked on this topic of multiple files being used to update / create a new file.
The simplest would be with a simple awk script. My only question I can see from looking at your data would be, is it important that the quotes are changed from single to double as per your example?
Please use ***[code][/code]*** tags around your code and data, to preserve the original formatting and to improve readability. Do not use quote tags, bolding, colors, "start/end" lines, or other creative techniques. Thanks.
You know, I don't think this really has to be done in such a complex line-by-line way. Since each number apparently corresponds to a specific fixed region, then it may instead be more efficient to simply set up a one-time master array of num=name pairs (perhaps in a separate file), and then simply run over the input file replacing the one with the other.
locId,country,region,city,postalCode,latitude,longitude
19319,"BR","RJ","Rio De Janeiro","",-22.9000,-43.2333,,
19304,"BR","SP","São Paulo","",-23.4733,-46.6658,,
156797,"BR","GO","Goias","",-15.9333,-50.1333,,
187217,"BR","MT","Mato Grosso","",-15.2500,-56.7333,,
The trickiest parts are in getting it to handle the quotes and header line correctly. But I was able to simplify that a lot by simply keeping the quote marks on the entries in the codefile. Otherwise it would involve a more complicated process of removing them, replacing the number, and re-adding them to the output.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.