MySQL encoding problems (HELP HELP HELP!)

tulane · 04-17-2010, 08:54 PM

Hi,

Sorry for the unprofessional subject, but I am literally at my wits' end. There's just something that I'm not getting, but I wish I knew what it was.

Here is the problem:

I recently migrated from one server with MySQL 3.23 (yes, it was a little outdated) to another with MySQL 5.0.77. I had a database under MySQL 3.23 that contained data in the cp1251 encoding. When I did a mysqldump of the data, it was converted to utf-8 for some reason (I specified --default-character-set=cp1251) and (AND THIS IMPORTANT) the iso-8859-1 subset of utf-8 (i.e. instead of cyrillic characters, all the data got exported as a bunch of vowels with various diacritics. I don't understand why, nor what I can do prevent mysqldump from doing this.

It then, of course, imported as utf8 to the new server. While I can get the contents of the database to display as cyrillic by setting the character-encoding on the website as "windows-1251", this is a backwards way of going about it. Furthermore, and importantly, sorting doesn't work properly.

So my question is:
1) Is there any way I can get mysqldump on the old server to recognize that the data is cp1251 and not iso-8859-1? That would solve my problem.
2) If that fails, is there any way I can convert the latin diacritic symbols currently stored as utf8 to iso-8859-1? (Converting from that to cp1251 should be fairly straightforward... ?... I guess? Maybe? Hopefully?)

I've already tried:
iconv -f utf8 -t iso-8859-1
and
iconv -f utf8 -t cp1251
of the dump files. It doesn't work. Tells me there is an illegal input sequence at position X. Googling that has given me no satisfactory answer.

I've already looked at every source I could in order to solve this dilemma. Please, if you have any ideas on this, HEEEEEELLLLLLP!

norbert74 · 04-18-2010, 03:04 AM

Can you post the complete statement you use for the dump.
I'm sure you have already checked this, but nevertheless I ask:
Do you use the paramter --tab for your mysql dump?
Are there tables which contain columns in several char sets?

tulane · 04-18-2010, 05:01 PM

I did try using the --tab option. I tried various mysqldump commands, but none would produce a file with anything other than utf-8 and none would produce a file that could be converted iconv.

I DID finally solve the problem, though this has got to be the most convoluted and bass-ackwards way to do it. With any luck, some poor soul will find it useful.

I was using Putty to connect to the server, and in Putty I could set the character set Putty displayed for me. If you set the display character set to utf-8, Putty is effectively performing an on-the-fly conversion from utf-8 to ascii (i.e. the conversion that iconv can't seem to do). That got me thinking, so I turned on logging in Putty, then did:
cat mysqldump_file

Opened the log up in Wordpad on my Windows computer, where the default non-Unicode character set is cp1251. Sure enough, all the cyrillic characters were displaying correctly. From there, it was a simple step to save the file as a unicode file, upload to server and load to database.

But what the heck? My faith has been shaken. Since when does Unix need to use Windows as a crutch?

tulane · 04-18-2010, 08:07 PM

No, I spoke too soon. That did not solve the problem as the data that I saved in Windows is still unicode

I can't believe there's no way to resolve this. I mean, how hard can it be to do a simple unicode character to non-unicode character search-and-replace?

chrism01 · 04-18-2010, 09:13 PM

You could try posting in the MySQL forums http://forums.mysql.com/ or the mysqldump manual http://dev.mysql.com/doc/refman/5.0/en/mysqldump.html may help.

tulane · 04-18-2010, 09:56 PM

Solved it.

Ran a simple
sed -r -i 's/\u{hex_code_of_latin_symbol}/\u{hex_code_of_cyrillic_symbol}/g' mysqldump_file
for each letter.

It was only a little painful. If you were smarter than me you might be able to do this in one regex function. Latin diacritics are unicode hex codes 00c0 through 00ff while cyrillic ones are 0410 through 044f.

tulane · 04-18-2010, 09:57 PM

And chris:
I extensively searched through all available documentation before posting here. I'm really not the question asking type, preferring to research stuff on my own. In this case, I was just feeling I had no choice...

chrism01 · 04-19-2010, 01:18 AM

Fair enough, just thought the mysql forums would know if anyone...