[SOLVED] Moving from MySQL to Postgresql; Encoding issue

Ramurd · 02-26-2016, 05:01 AM

Hi all,

I have this MySQL database, and for unknown reasons it was created with LATIN1 encoding. However, the standards dictate UTF-8 encoding;
Now for many reasons, we want to move away from MySQL towards PostgreSQL (9.4 and later);

The thing is, one of the tables uses special characters (like country names with special glyphs in it (take for example how Curacao actually spells (with the c-cedille)).
I tried various ways to convert the LATIN1 encoding used to UTF-8, but so far, the data comes out garbled.

The databases run on the same host, which runs Linux (RHEL6).

What I tried:
- dumping the data, convert with iconv
- dumping the data with a PHP script, convert on the fly with iconv()
- dumping the data with the same PHP script and convert on the fly with mb_convert_encoding()

As a test, I created a PostgreSQL database with LATIN1 encoding and dump the data there; The data compares well, it looks the same... However, I don't know how to copy the data from one database to another; afaik PostgreSQL does not support that. So, I would need to dump and load...

However, I'm overlooking something, and I cannot figure out what... It's probably something silly and very small (thus easy to overlook); it may actually be so big that it doesn't fit on this world... :-)

Anyone who can enlighten me?

tronayne · 02-26-2016, 08:23 AM

You can define, system-wide, your system locale. You can see what it is by opening a terminal and

Code:

locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=C
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Mine, listed above, is set to en_US.UTF-8 (pretty much equivalent to UNICODE but not totally).

PostgreSQL is perfectly happy with UTF-8 (and your system probably will be too). If you're not defaulting to English, you can determine the UTF-8 locale setting with

Code:

locale -a

It's a list of every locale supported by Linux, so you'd want to pipe the output of local -a through grep to get the setting you want.

If you search Google for convert latin1 to utf-8 there are tools for converting; I have not tried any but the two at the top of the list might just work for you. Do a dump of your data base and run the dump though one of those tools and load the result into PostgreSQL would seem like the easy way to accomplish it.

My system locale is set as above because I am using DSpace, a large collection management system. I don't have any problems with using that locale with Slackware. Get into /etc/profile.d and look at lang.sh, that's where you set your system-wide locale. You can compare mine with your default lang.sh and set for your choice of language (after looking at locale -a of course).

Code:

cat lang.sh
#!/bin/sh
# Set the system locale.  (no, we don't have a menu for this ;-)
# For a list of locales which are supported by this machine, type:
#   locale -a

# en_US is the Slackware default locale:
#export LANG=en_US

# 'C' is the old Slackware (and UNIX) default, which is 127-bit
# ASCII with a charmap setting of ANSI_X3.4-1968.  These days,
# it's better to use en_US or another modern $LANG setting to
# support extended character sets.
#export LANG=C

# There is also support for UTF-8 locales, but be aware that
# some programs are not yet able to handle UTF-8 and will fail to
# run properly.  In those cases, you can set LANG=C before
# starting them.  Still, I'd avoid UTF unless you actually need it.
export LANG=en_US.UTF-8

# Another option for en_US:
#export LANG=en_US.ISO8859-1

# One side effect of the newer locales is that the sort order
# is no longer according to ASCII values, so the sort order will
# change in many places.  Since this isn't usually expected and
# can break scripts, we'll stick with traditional ASCII sorting.
# If you'd prefer the sort algorithm that goes with your $LANG
# setting, comment this out.
export LC_COLLATE=C

# End of /etc/profile.d/lang.sh

You will need to log off and back on for the change to take effect; you may need to reboot. Just make sure to open an terminal and execute locale to check.

Hope this helps some.

Ramurd · 02-26-2016, 08:51 AM

Thanks for the clear and extensive answer; not sure it solves, but might get me picking up that piece that I am obviously missing.

From the google results, this is the second:

Code:

mysqldump --add-drop-table --user=root my_database | replace CHARSET=latin1 CHARSET=utf8 | iconv -f latin1 -t utf8 | less

Still the data is garbled. In a way I actually had tried this one; but had written to file first... and the php script does the same iconv thing I guess.
The system locale is, like yours, UTF-8:

Code:

locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

My PostgreSQL is indeed happy with UTF-8; but is not happy with being fed LATIN1 data in a UTF-8 database. Where MySQL will silently does some mumbojumbo, so that things might look okay, but actually are not.

Another thing I tried: create a LATIN1 database in PostgreSQL, dump it and convert that data to UTF-8 ... same result; for some reason the diacritics garble up

The result of the query on the latin1 database:

België | BE

Same query on the utf-8 database:

BelgiÃ« | BE

Maybe this is due to collation; or something else... I cannot really grab why the data looks garbled after conversion... this should be cleaner, right?

tronayne · 02-26-2016, 12:17 PM

I did not realize this was so dang complicated (don't have any MySQL/MariaDB data bases that have "foreign" characters in them -- I usually just unload from MariaDB and load into PostgreSQL).

Looking down the list of Google search I find Convert mysql database from latin1 to utf8 the RIGHT way (https://blogs.harvard.edu/djcp/2010/01/convert-mysql-database-from-latin1-to-utf8-the-right-way/), which looks as though it would be a "good" way to accomplish the task. It also looks like some folk didn't have much luck if you read the whole thing. Sigh.

I'm going to do some digging and see if there's anything that might be of use that's a little newer than that article.

tronayne · 02-26-2016, 12:58 PM

Just wiggling down the line there are two things we need to be aware of: MySQL/MariaDB default to latin1, PostgreSQL defaults to UTF-8. Therein, I suppose, lies the rub.

One question: are you actually using MySQL or are you using MariaDB? Looking at Character Sets and Collations in MariaDB, the quantity is larger in the latter and it looks like it might be easier to do the conversion in MariaDB (there are suggested ways of doing so in MariaDB documentation).

I noted that you're using Slackware (although the production server is Red Hat)? If you're running Slackware 14.1 (fully patched) you might be able to copy the data base off and try loading it in a MariaDB, perhaps doing the conversion on the fly. Slackware loads MariaDB by default and I've not had any trouble using it with existing MySQL data bases, maybe even just fiddle with the specific table you're having trouble with.

This seems harder than it ought to be but, given the real differences between latin1 and UTF-8 (three bytes per character in UTF-8).

Hope this helps.

Ramurd · 02-26-2016, 01:49 PM

Thanks;

Will dig through all possible iconv codepages and see if any gives a nice result:

Code:

iconv --list | while read cpg
do
  echo "Trying ${cpg}"
  grep Cura input.sql | iconv -f ${cpg} -t utf-8
  sleep 2
done

This on a very small subset containing 'Curaçao' ; hopefully one day I'll get it translated to utf-8
Stopping for the weekend now; will continue monday :-)

Ramurd · 02-26-2016, 01:54 PM

The server I take the data from is actually a real MySQL (and rather outdated)
I do run Slackware 14.1 at the office; so will give your suggestion a try next monday... it's helpful and at least worth a try.

One thing that I noticed; since the mysql database says it's latin1; I did a mysqldump and file tells me the file is ISO-8859 (not which one; but using any of those with iconv did not give a proper result)

The slackware 'mysqldump' is actually MariaDB; and the data looks already a bit garbled there, unless I tell it character encoding is latin1 explicitly on the commandline.

You're being a great help; at the very least you're helping me out of the circle where I don't see a solution...

Fwiw:
The server is RHEL, my workstation is Slackware; The source database is MySQL (mysql Ver 14.14 Distrib 5.1.73, for redhat-linux-gnu (x86_64) using readline 5.1); The target database PostgreSQL 9.5.0 (with plans to upgrade to 9.5.1)

Ramurd · 02-29-2016, 06:26 AM

Good morning again :-) It's monday and that means... MONDAY...

Ah well; so I went along and started like this (after a few cups of coffee, to get started)
Installed MariaDB (yea, not that much a fan); imported data from the server from a mysqldump piped through mysql.
Checked the data, and it looked ok.

Then went along like the documentation: ALTER TABLE <table> CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Then did a new 'SELECT' and it looked OKAY. (But I guess my client probably thinks everything is LATIN1)... but I felt brave (dunno why): Dumped the data with mysqldump, removed the mysql garbage and did a few changes to make the data compatible (\' changed to '' and backticks removed entirely; the data is not that complex and I already had made the schema)

Truncated the target table and loaded the file...
But it's still garbled... also if I look at the file:

Code:

file dump.sql

shows UTF-8 with very long lines.

... aaaaaand... it's garbled when I look into the file; So maybe it's my local encoding, which is the Slackware default en_US; so, changed it a bit:

Code:

export LANG=utf8

still no luck... what am I missing?

tronayne · 02-29-2016, 07:59 AM

This might sound a little silly but it's the way I deal with all load/unload files in any data base (both MySQL/MariaDB and PostgreSQL).

I come from a long, long time working with Informix (like over 30 years), bear with me here. Informix defaults to a vertical bar as a field separator -- doesn't appear in any language, doesn't cause any problems, don't have to quote strings, no problem with any characters. I get data files with tabs, I mass change all the tabs to vertical bars, usually delete any double quotes (not necessary but I do it anyway), then load. I do that with any CSV file I receive and it saves a lot of grief. All DBMSs accept unquoted fields into columns defined as strings (and don't care if there happens to be an apostrophe in the string).

You simply use the --fields-enclosed-by=... option to with mysqldump (and the corresponding loader).

MySQL/MariaDB defaults to latin1, PostgreSQL defaults to UTF-8. One option in mysqldump is the --default-character-set during the dump, don't know if you've tried that but it may work for you. The trick is to do as little screwing around with the dump as possible -- you're looking for a clean dump that you can subsequently load into a different data base and you're going to want UTF-8 plain text characters.

What I do to clean up CSV and other data files is to use a little sed file:

Code:

cat sedit
s/",/|/g
s/"/|/g
s/,/|/g
s/  */ /g
s/  *$//
s/| /|/g
/^$/d
s/|$//g
s/$/|/g

It's used

Code:

sed -f sedit input_file > output_file

Cleans up messes in CSV files (and sed is really, really fast). The directives change quoted fields to |, change all double spaces to single, removes trailing blanks, removes blank lines, that sort of thing. You get a "clean" file; you can look at it in a text editor (or OpenOffice/LibreOffice), you don't want any embedded data base directives, just the data as text. Use as few directives as you can to get there and try to clean things up with pipes and filters externally from the data base engine (it's a lot easier). I kind of wonder if doing a latin1 to UTF conversion with sed might be a way, just have to find a list of latin1 and a list of UTF-8 and do it that way, substitute this with that sort of thing. Hm. Little project. Have to go away for a couple hours this morning (it's like 0900 here) and give that a look-see.

There has to be some way, somehow to do this. I'll keep digging an see what I can come up with.

tronayne · 02-29-2016, 08:46 AM

I just thought of something -- are you looking at your output with vi? It's 8-bit, you wont see the funky characters. You will see them with vim, though (and with OpenOffice/LibreOffic).

Head over to Dugan Chen's web page at http://duganchen.ca/writings/slackware/fonts/, maybe read some of the info on that page, but look down to the fourth paragraph for his download that includes the file quickbrown.txt. Download the file, untar it somewhere (it's bunch of good stuff) and look at quickbrown.txt with cat and vi and vim. You should see all the characters in various languages (not with vi). If you don't, read around a little and see what he suggests.

You might actually have the conversion and, well, don't know it, eh?

Sorry, I have occasional brain farts, forgot all about that.

Hope this helps.

Ramurd · 02-29-2016, 10:21 AM

Hehe; thanks! I'm bearing with you... And indeed I'm feeling very, very stupid.
While I alias vi to vim (for that and other issues) I only just realized something else...

It's the terminal and the encoding it is in... (I tend to work in KDE, as I have many graphical interfaces to work with, and personally I'm happy with this) ...
So, I went to the terminal settings, and guess what... it's in ISO-8859-15 encoding...

So, I changed this to UTF-8; grepped the string from the file... it's garbled. Yay!
So I grepped the same string; run through iconf -f latin1 -t utf8 and... it's what I hoped for!

Now the thing is; did I have a problem at first, or was I just perceiving a problem that wasn't there?

tronayne · 02-29-2016, 01:06 PM

You may not have had a problem, or, what the heck, you might have; best news is It Works and you no longer have the problem. All is well that ends.

I think the problem may have been the ISO-8859-15 encoding for viewing rather than UTF-8 encoding. I cannot say for absolute sure that those are incompatible (kinda sorta looks that way though). These multiple encoding things are a pain where you sit.

Don't feel stupid, it isn't you it's somebody else's idea of what should and should not be default settings -- you've got your locale set to UTF-8? Other stuff should honer that, not force you to know that you have to reset your window manager default. Just keep that in mind when you start using PostgreSQL and are looking at search results and the like.

Glad to know you've got it solved.

Best of luck with the migration.

Ramurd · 03-01-2016, 07:43 AM

Hehee; thanks for the support though. Marking this thread as solved.