LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-07-2013, 04:40 PM   #1
arbex5
LQ Newbie
 
Registered: Jul 2012
Location: Brazil
Distribution: Ubuntu
Posts: 17

Rep: Reputation: Disabled
Post A challenging script - Replace field of CSV file based on another CSV file


Hi,

I have a request that I think can be solved with a script.

I have two files, one is a GeoIP data from Maxmind, lets call File1, and the other is a file with cities and zone names, lets call File2. I need to replace a field in Maxmind that is numeric for a field in the other file wich is a name. Below there is part of each file.

File1

locId,country,region,city,postalCode,latitude,longitude
19319,"BR","21","Rio De Janeiro","",-22.9000,-43.2333,,
19304,"BR","27","São Paulo","",-23.4733,-46.6658,,
156797,"BR","29","Goias","",-15.9333,-50.1333,,
187217,"BR","14","Mato Grosso","",-15.2500,-56.7333,,

File2

zone_id, country, region, region_code, status
448, BR, 'Goias', 'GO', 1
450, BR, 'Mato Grosso', 'MT', 1
458, BR, 'Rio de Janeiro', 'RJ', 1
464, BR, 'Sao Paulo', 'SP', 1

Basically I need to replace all fields in Bold of File1 to it's corresponding fields in Bold and Red of File2 using the the region as a reference.

The end result should be a file File3 like this:

File3

locId,country,region,city,postalCode,latitude,longitude
19319,"BR","RJ","Rio De Janeiro","",-22.9000,-43.2333,,
19304,"BR","SP","São Paulo","",-23.4733,-46.6658,,
156797,"BR","GO","Goias","",-15.9333,-50.1333,,
187217,"BR","MT","Mato Grosso","",-15.2500,-56.7333,,

This file is huge, so make it manually is unpractical. And I have no idea how to accomplish it. I hope you do.
Thank you.
 
Old 06-07-2013, 06:07 PM   #2
Ser Olmy
Senior Member
 
Registered: Jan 2012
Distribution: Slackware
Posts: 3,347

Rep: Reputation: Disabled
If I understand you correctly, the pseudocode would look something like this:
Code:
while not EOF <file1> do
  read LINE from <file1>
  let CITY be field 4 of LINE where separator = ","
  let REGION be field 4 of (search <file2> for line containing CITY) where separator = ","
  let MODIFIED_LINE be (replace field 3 of LINE where separator = "," with REGION)
  write MODIFIED_LINE to <file3>
done
Does that look about right?

Edit: This should do the trick.
Code:
#!/bin/bash

geoip=file1
zonenames=file2
outfile=file3

exec 3< $geoip

# copy the header line verbatim from the source file
read line <&3
echo $line > $outfile

while read line <&3 ; do
  city=$(echo $line | cut --delimiter=, --fields 4 | sed "s/^\([\"']\)\(.*\)\1\$/\2/g")
  # Strip diacritical marks
  city="$(echo $city | iconv --to-code=ascii//TRANSLIT)"
  old_region=$(echo $line | cut --delimiter=, --fields 3 | sed "s/^\([\"']\)\(.*\)\1\$/\2/g")
  new_region=$(cat $zonenames | grep -ia "$city" | cut --delimiter=, --fields 4 | sed "s/[' ]//g")
  modified_line=$(echo $line | sed "s/$old_region/$new_region/" )
  echo $modified_line>> $outfile
done

exec 3>&-
This is probably not the best/optimal solution, but it works on the test data you provided.

If it weren't for the fact that the fields are quoted differently in the two files (and the use of diactitics in the first file), this would have been a perfect job for join.

Last edited by Ser Olmy; 06-07-2013 at 07:36 PM.
 
Old 06-07-2013, 11:10 PM   #3
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,011

Rep: Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194
Did you search this site at all for this type of problem? There are a plethora of questions asked on this topic of multiple files being used to update / create a new file.

The simplest would be with a simple awk script. My only question I can see from looking at your data would be, is it important that the quotes are changed from single to double as per your example?
 
Old 06-08-2013, 09:55 AM   #4
arbex5
LQ Newbie
 
Registered: Jul 2012
Location: Brazil
Distribution: Ubuntu
Posts: 17

Original Poster
Rep: Reputation: Disabled
Hi Ser Olmy,
Thank you very much, I really appreciate your help

I followed your pseudocode and it's simple perfect to me.

First thing, I replaced the single quote from File2 to double quotes.
Then named the files on the script, and executed it.

I'm getting some issues but I think is related to some strange characters.
Right now, I dealing with them.

I'll keep trying then I post you back.
Thank you again.
 
Old 06-08-2013, 12:10 PM   #5
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,011

Rep: Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194
well, if we can assume the fields are containing the correct data, ie Rio de Janeiro != Rio De Janeiro,
then it could be as simple as:
Code:
awk -F" *, *" 'FNR==NR{gsub(/\047/,"\"");id[$3]=$4;next}$3 = id[$4]' OFS="," file2 file1 > file3
If the header line is still required simply change last part to:
Code:
FNR == 1 || $3 = id[$4]
 
1 members found this post helpful.
Old 06-08-2013, 01:30 PM   #6
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,881

Rep: Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660Reputation: 660
Quote:
Originally Posted by grail View Post
Rio de Janeiro != Rio De Janeiro
Similarly, São Paulo != Sao Paulo.

Computers are picky about such seemingly minor discrepancies.

Daniel B. Martin
 
Old 06-08-2013, 10:30 PM   #7
arbex5
LQ Newbie
 
Registered: Jul 2012
Location: Brazil
Distribution: Ubuntu
Posts: 17

Original Poster
Rep: Reputation: Disabled
Solved

Hi Ser Olmy,
I triyed after adjust the data but I got an error, I don't know what is.
here is the error message:

Code:
sed: -e expression #1, char 7: unterminated `s' command

Hi grail,

I tested your script and it worked very nice and very fast.
Thank you very much.


Hi Daniel,
I was aware of that, so I used the command below to remove those accents:

Code:
$ unaccent ISO-8859-1 < myfile > myfile.unaccent
Thank you all guys.
I really appreciate your help.


I'll mark as SOLVED.

Last edited by arbex5; 06-09-2013 at 09:43 AM. Reason: Tagging the code to respect the formatting.
 
Old 06-09-2013, 09:15 AM   #8
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Please use ***[code][/code]*** tags around your code and data, to preserve the original formatting and to improve readability. Do not use quote tags, bolding, colors, "start/end" lines, or other creative techniques. Thanks.


You know, I don't think this really has to be done in such a complex line-by-line way. Since each number apparently corresponds to a specific fixed region, then it may instead be more efficient to simply set up a one-time master array of num=name pairs (perhaps in a separate file), and then simply run over the input file replacing the one with the other.
 
Old 06-09-2013, 09:50 AM   #9
arbex5
LQ Newbie
 
Registered: Jul 2012
Location: Brazil
Distribution: Ubuntu
Posts: 17

Original Poster
Rep: Reputation: Disabled
Not line by line.

Hi David,

Changed the code part.

I understand your idea of create this array, but then it would be necessary to create it first, right?
How could this being accomplished?

Thank you.
 
Old 06-11-2013, 11:14 AM   #10
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
It's quite easy, really. At least with the input you supplied.

Using these two files:
Code:
#codefile.txt
"14","MT"
"21","RJ"
"27","SP"
"29","GO"

#infile.txt
locId,country,region,city,postalCode,latitude,longitude
19319,"BR","21","Rio De Janeiro","",-22.9000,-43.2333,,
19304,"BR","27","São Paulo","",-23.4733,-46.6658,,
156797,"BR","29","Goias","",-15.9333,-50.1333,,
187217,"BR","14","Mato Grosso","",-15.2500,-56.7333,,
And this command:
Code:
awk 'BEGIN{ FS=OFS="," } NR==FNR { a[$1]=$2 } NR!=FNR && FNR>1{ $3=a[$3]; print } NR!=FNR && FNR==1' codefile.txt infile.txt
I get this output:
Code:
locId,country,region,city,postalCode,latitude,longitude
19319,"BR","RJ","Rio De Janeiro","",-22.9000,-43.2333,,
19304,"BR","SP","São Paulo","",-23.4733,-46.6658,,
156797,"BR","GO","Goias","",-15.9333,-50.1333,,
187217,"BR","MT","Mato Grosso","",-15.2500,-56.7333,,
The trickiest parts are in getting it to handle the quotes and header line correctly. But I was able to simplify that a lot by simply keeping the quote marks on the entries in the codefile. Otherwise it would involve a more complicated process of removing them, replacing the number, and re-adding them to the output.
 
Old 06-11-2013, 11:40 AM   #11
arbex5
LQ Newbie
 
Registered: Jul 2012
Location: Brazil
Distribution: Ubuntu
Posts: 17

Original Poster
Rep: Reputation: Disabled
Yeah I see.
Quite a beautiful solution too.
I'll study them.

Thank you very much David.
 
Old 06-12-2013, 06:56 AM   #12
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 10,011

Rep: Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194Reputation: 3194
If I may ... small adjustment using David's files, if you use the command next in the first test, the rest is reduced greatly:
Code:
awk 'BEGIN{ FS=OFS="," } NR==FNR { a[$1]=$2;next } FNR>1{ $3=a[$3] } 1' codefile.txt infile.txt
If you are new to awk, the above is still fairly clear.

It can be altered again to look like:
Code:
awk 'BEGIN{ FS=OFS="," } NR==FNR { a[$1]=$2;next } FNR == 1 || $3=a[$3] ' codefile.txt infile.txt
This can be a little confusing at first, but awk's default for a true action is to print
 
  


Reply

Tags
csv, script



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] How to script csv editing? Remove rows from csv file that do not contain certain text ingram87 Linux - Software 9 08-03-2012 12:45 PM
[SOLVED] Script to multiply 29th field in CSV file by 1,56 knilux Programming 44 08-03-2012 11:57 AM
Replace a field in quotes within a csv file hattori.hanzo Programming 3 05-29-2012 01:10 AM
Bash or PHP: Split csv file based on field value? guest Programming 4 02-06-2009 12:57 AM
AWK: change a particular field in a csv file help help help!!!! haydar68 Programming 20 08-03-2008 01:10 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 01:18 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration