Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
 |
06-07-2013, 04:40 PM
|
#1
|
LQ Newbie
Registered: Jul 2012
Location: Brazil
Distribution: Ubuntu
Posts: 17
Rep: 
|
A challenging script - Replace field of CSV file based on another CSV file
Hi,
I have a request that I think can be solved with a script.
I have two files, one is a GeoIP data from Maxmind, lets call File1, and the other is a file with cities and zone names, lets call File2. I need to replace a field in Maxmind that is numeric for a field in the other file wich is a name. Below there is part of each file.
File1
locId,country,region,city,postalCode,latitude,longitude
19319,"BR","21","Rio De Janeiro","",-22.9000,-43.2333,,
19304,"BR","27","São Paulo","",-23.4733,-46.6658,,
156797,"BR","29","Goias","",-15.9333,-50.1333,,
187217,"BR","14","Mato Grosso","",-15.2500,-56.7333,,
File2
zone_id, country, region, region_code, status
448, BR, 'Goias', 'GO', 1
450, BR, 'Mato Grosso', 'MT', 1
458, BR, 'Rio de Janeiro', 'RJ', 1
464, BR, 'Sao Paulo', 'SP', 1
Basically I need to replace all fields in Bold of File1 to it's corresponding fields in Bold and Red of File2 using the the region as a reference.
The end result should be a file File3 like this:
File3
locId,country,region,city,postalCode,latitude,longitude
19319,"BR","RJ","Rio De Janeiro","",-22.9000,-43.2333,,
19304,"BR","SP","São Paulo","",-23.4733,-46.6658,,
156797,"BR","GO","Goias","",-15.9333,-50.1333,,
187217,"BR","MT","Mato Grosso","",-15.2500,-56.7333,,
This file is huge, so make it manually is unpractical. And I have no idea how to accomplish it. I hope you do.
Thank you.
|
|
|
06-07-2013, 06:07 PM
|
#2
|
Senior Member
Registered: Jan 2012
Distribution: Slackware
Posts: 3,349
Rep: 
|
If I understand you correctly, the pseudocode would look something like this:
Code:
while not EOF <file1> do
read LINE from <file1>
let CITY be field 4 of LINE where separator = ","
let REGION be field 4 of (search <file2> for line containing CITY) where separator = ","
let MODIFIED_LINE be (replace field 3 of LINE where separator = "," with REGION)
write MODIFIED_LINE to <file3>
done
Does that look about right?
Edit: This should do the trick.
Code:
#!/bin/bash
geoip=file1
zonenames=file2
outfile=file3
exec 3< $geoip
# copy the header line verbatim from the source file
read line <&3
echo $line > $outfile
while read line <&3 ; do
city=$(echo $line | cut --delimiter=, --fields 4 | sed "s/^\([\"']\)\(.*\)\1\$/\2/g")
# Strip diacritical marks
city="$(echo $city | iconv --to-code=ascii//TRANSLIT)"
old_region=$(echo $line | cut --delimiter=, --fields 3 | sed "s/^\([\"']\)\(.*\)\1\$/\2/g")
new_region=$(cat $zonenames | grep -ia "$city" | cut --delimiter=, --fields 4 | sed "s/[' ]//g")
modified_line=$(echo $line | sed "s/$old_region/$new_region/" )
echo $modified_line>> $outfile
done
exec 3>&-
This is probably not the best/optimal solution, but it works on the test data you provided.
If it weren't for the fact that the fields are quoted differently in the two files (and the use of diactitics in the first file), this would have been a perfect job for join.
Last edited by Ser Olmy; 06-07-2013 at 07:36 PM.
|
|
|
06-07-2013, 11:10 PM
|
#3
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,038
|
Did you search this site at all for this type of problem? There are a plethora of questions asked on this topic of multiple files being used to update / create a new file.
The simplest would be with a simple awk script. My only question I can see from looking at your data would be, is it important that the quotes are changed from single to double as per your example?
|
|
|
06-08-2013, 09:55 AM
|
#4
|
LQ Newbie
Registered: Jul 2012
Location: Brazil
Distribution: Ubuntu
Posts: 17
Original Poster
Rep: 
|
Hi Ser Olmy,
Thank you very much, I really appreciate your help
I followed your pseudocode and it's simple perfect to me.
First thing, I replaced the single quote from File2 to double quotes.
Then named the files on the script, and executed it.
I'm getting some issues but I think is related to some strange characters.
Right now, I dealing with them.
I'll keep trying then I post you back.
Thank you again.
|
|
|
06-08-2013, 12:10 PM
|
#5
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,038
|
well, if we can assume the fields are containing the correct data, ie Rio de Janeiro != Rio De Janeiro,
then it could be as simple as:
Code:
awk -F" *, *" 'FNR==NR{gsub(/\047/,"\"");id[$3]=$4;next}$3 = id[$4]' OFS="," file2 file1 > file3
If the header line is still required simply change last part to:
Code:
FNR == 1 || $3 = id[$4]
|
|
1 members found this post helpful.
|
06-08-2013, 01:30 PM
|
#6
|
Senior Member
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,896
|
Quote:
Originally Posted by grail
Rio de Janeiro != Rio De Janeiro
|
Similarly, São Paulo != Sao Paulo.
Computers are picky about such seemingly minor discrepancies.
Daniel B. Martin
|
|
|
06-08-2013, 10:30 PM
|
#7
|
LQ Newbie
Registered: Jul 2012
Location: Brazil
Distribution: Ubuntu
Posts: 17
Original Poster
Rep: 
|
Solved
Hi Ser Olmy,
I triyed after adjust the data but I got an error, I don't know what is.
here is the error message:
Code:
sed: -e expression #1, char 7: unterminated `s' command
Hi grail,
I tested your script and it worked very nice and very fast.
Thank you very much.
Hi Daniel,
I was aware of that, so I used the command below to remove those accents:
Code:
$ unaccent ISO-8859-1 < myfile > myfile.unaccent
Thank you all guys.
I really appreciate your help.
I'll mark as SOLVED.
Last edited by arbex5; 06-09-2013 at 09:43 AM.
Reason: Tagging the code to respect the formatting.
|
|
|
06-09-2013, 09:15 AM
|
#8
|
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852
|
Please use *** [code][/code]*** tags around your code and data, to preserve the original formatting and to improve readability. Do not use quote tags, bolding, colors, "start/end" lines, or other creative techniques. Thanks.
You know, I don't think this really has to be done in such a complex line-by-line way. Since each number apparently corresponds to a specific fixed region, then it may instead be more efficient to simply set up a one-time master array of num=name pairs (perhaps in a separate file), and then simply run over the input file replacing the one with the other.
|
|
|
06-09-2013, 09:50 AM
|
#9
|
LQ Newbie
Registered: Jul 2012
Location: Brazil
Distribution: Ubuntu
Posts: 17
Original Poster
Rep: 
|
Not line by line.
Hi David,
Changed the code part.
I understand your idea of create this array, but then it would be necessary to create it first, right?
How could this being accomplished?
Thank you.
|
|
|
06-11-2013, 11:14 AM
|
#10
|
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852
|
It's quite easy, really. At least with the input you supplied.
Using these two files:
Code:
#codefile.txt
"14","MT"
"21","RJ"
"27","SP"
"29","GO"
#infile.txt
locId,country,region,city,postalCode,latitude,longitude
19319,"BR","21","Rio De Janeiro","",-22.9000,-43.2333,,
19304,"BR","27","São Paulo","",-23.4733,-46.6658,,
156797,"BR","29","Goias","",-15.9333,-50.1333,,
187217,"BR","14","Mato Grosso","",-15.2500,-56.7333,,
And this command:
Code:
awk 'BEGIN{ FS=OFS="," } NR==FNR { a[$1]=$2 } NR!=FNR && FNR>1{ $3=a[$3]; print } NR!=FNR && FNR==1' codefile.txt infile.txt
I get this output:
Code:
locId,country,region,city,postalCode,latitude,longitude
19319,"BR","RJ","Rio De Janeiro","",-22.9000,-43.2333,,
19304,"BR","SP","São Paulo","",-23.4733,-46.6658,,
156797,"BR","GO","Goias","",-15.9333,-50.1333,,
187217,"BR","MT","Mato Grosso","",-15.2500,-56.7333,,
The trickiest parts are in getting it to handle the quotes and header line correctly. But I was able to simplify that a lot by simply keeping the quote marks on the entries in the codefile. Otherwise it would involve a more complicated process of removing them, replacing the number, and re-adding them to the output.
|
|
|
06-11-2013, 11:40 AM
|
#11
|
LQ Newbie
Registered: Jul 2012
Location: Brazil
Distribution: Ubuntu
Posts: 17
Original Poster
Rep: 
|
Yeah I see.
Quite a beautiful solution too.
I'll study them.
Thank you very much David.
|
|
|
06-12-2013, 06:56 AM
|
#12
|
LQ Guru
Registered: Sep 2009
Location: Perth
Distribution: Arch
Posts: 10,038
|
If I may ... small adjustment using David's files, if you use the command next in the first test, the rest is reduced greatly:
Code:
awk 'BEGIN{ FS=OFS="," } NR==FNR { a[$1]=$2;next } FNR>1{ $3=a[$3] } 1' codefile.txt infile.txt
If you are new to awk, the above is still fairly clear.
It can be altered again to look like:
Code:
awk 'BEGIN{ FS=OFS="," } NR==FNR { a[$1]=$2;next } FNR == 1 || $3=a[$3] ' codefile.txt infile.txt
This can be a little confusing at first, but awk's default for a true action is to print 
|
|
|
All times are GMT -5. The time now is 07:56 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|