LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 01-28-2010, 06:48 PM   #1
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
MySQL, UTF-8 and accented characters


Hi,

How do I stop a MySQL database that is defined as
utf-8 from assuming that 'Sergio' and 'Sérgio' are
the same thing?

It's refusing to insert 'Sérgio' claiming that:
Code:
ERROR 1062 (23000) at line 4433: Duplicate entry 'Sérgio' for key 1


Cheers,
Tink
 
Old 01-29-2010, 08:10 PM   #2
bigrigdriver
LQ Addict
 
Registered: Jul 2002
Location: East Centra Illinois, USA
Distribution: Debian stable
Posts: 5,908

Rep: Reputation: 356Reputation: 356Reputation: 356Reputation: 356
I think the answer can be found in Chapter 9 of the MySql Reference Manual. At the moment on creating the database, you need to specify the character set to be used (which contains both the accented and unaccented leters), not just specifying utf-8. There are several character sets listed, such as German, Swedish, English, etc.

You may be able to redefine your database to treat accented letters as different from unaccented, or you may be able to redefine your entries/queries to treat them as different.

I hope that this helps solve the problem, or at least gets the cognitive juices flowing which leads to the solution.
 
Old 01-30-2010, 03:27 AM   #3
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067

Original Poster
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Quote:
Originally Posted by bigrigdriver View Post
I think the answer can be found in Chapter 9 of the MySql Reference Manual. At the moment on creating the database, you need to specify the character set to be used (which contains both the accented and unaccented leters), not just specifying utf-8. There are several character sets listed, such as German, Swedish, English, etc.
I guess I'll have a read of chapter 9, even though the concept of
having to specify TWO character sets (e.g. UTF-8 and Swedish)
appears rather moronic considering that UTF was conceived to get
rid of such idiosyncratic distinctions. But then, this is MySQL,
and I've never held it in very high esteem.



Quote:
Originally Posted by bigrigdriver View Post
You may be able to redefine your database to treat accented letters as different from unaccented, or you may be able to redefine your entries/queries to treat them as different.

I hope that this helps solve the problem, or at least gets the cognitive juices flowing which leads to the solution.
Heh. Kind of. Thanks for the hint - where ever it may lead.



Cheers,
Tink
 
Old 02-01-2010, 04:43 PM   #4
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067

Original Poster
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
'lo patient readers,

A brief update ... after 4 days of ongoing pain I still haven't managed to
import my data ... whether I use "INSERT" statements, "LOAD DATA", or the
command line tool mysqlimport, I can't get accented characters into the database.

Short of setting my system locale to cp1250 which is MySQLs (odd) default
and reinstalling everything I've tried pretty much every option and combination
available, using latin1 (otherwise known as iso-8856-1), utf8 and ASCII with
en_NZ and en_US from the command-line (and in the database server and client
config files), and the best thing I can achieve is either names with accents
to be truncated in the database (skipping ~50000 people because of 'duplicate key
violations'), or giving a "funny" character that looks something like y with a
circumflex for an accented e and a y with a grave for an i with a grave - which
would help if one was willing to memorize all the alternatives and use them in
searches .... hahaha. And that's with collation being set to either utf8_generic_ci
or utf8_bin.


People in the #mysql channel on freenode didn't have any ideas, and quite frankly
I'm pretty fed up with the product and its community.

Funnily enough PostgreSQL on the same machine imported the same file w/o a
hick-up, albeit somewhat slower than MySQLs pathetic workings.



Anyway ... should anyone have any other ideas I'd be grateful to hear of them.



Cheers,
Tink
 
Old 02-01-2010, 05:08 PM   #5
bathory
LQ Guru
 
Registered: Jun 2004
Location: Piraeus
Distribution: Slackware
Posts: 13,163
Blog Entries: 1

Rep: Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032Reputation: 2032
Have you tried to transform the .sql file from utf-8 to iso8859-1, or whatever encoding you have to use:
Code:
cat file.sql | iconv -f utf-8 -t iso8859-1 > file-iso8859-1.sql
 
Old 02-01-2010, 06:59 PM   #6
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067

Original Poster
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Most certainly did ... and variations thereof. As I said - this has been
going for over 4 days now ... :/
 
Old 02-02-2010, 03:54 AM   #7
carbonfiber
Member
 
Registered: Sep 2009
Location: Sparta
Posts: 237

Rep: Reputation: 46
Quickie #1: set the collation to utf8_bin (default is utf8_general_ci for a utf8 character set)
Quickie #2: set the character set to 'binary'

Look up the difference between utf8_general_ci and utf8_bin to see why.

Last edited by carbonfiber; 02-02-2010 at 04:04 AM.
 
Old 02-02-2010, 04:12 AM   #8
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067

Original Poster
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Quote:
Originally Posted by carbonfiber View Post
Quickie #1: set the collation to utf8_bin (default is utf8_general_ci for a utf8 character set)
Quickie #2: set the character set to 'binary'

Look up the difference between utf8_general_ci and utf8_bin to see why.
[edit]
Ooops .. that just changed.
Tried #2 at this stage (which was #1 a few minutes ago)
[/edit]

Interesting suggestion - tried it. Same result. Instead of
Code:
ì
I get
Code:
ý

Cheers,
Tink

Last edited by Tinkster; 02-02-2010 at 04:14 AM.
 
Old 02-02-2010, 04:19 AM   #9
carbonfiber
Member
 
Registered: Sep 2009
Location: Sparta
Posts: 237

Rep: Reputation: 46
Code:
mysql> create table tinkster (name varchar(12) character set utf8 collate utf8_bin unique);
Query OK, 0 rows affected (0.06 sec)
mysql> insert into tinkster values ('Sergio'),('Sérgio'),('ì'),('ý');
Query OK, 4 rows affected (0.00 sec)
Records: 4  Duplicates: 0  Warnings: 0
mysql> select * from tinkster;
+---------+
| name    |
+---------+
| Sergio  |
| Sérgio  |
| ì       |
| ý       |
+---------+
4 rows in set (0.00 sec)
 
Old 02-02-2010, 04:25 AM   #10
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067

Original Poster
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Quote:
Originally Posted by carbonfiber View Post
Code:
mysql> create table tinkster (name varchar(12) character set utf8 collate utf8_bin unique);
Query OK, 0 rows affected (0.06 sec)
mysql> insert into tinkster values ('Sergio'),('Sérgio'),('ì'),('ý');
Query OK, 4 rows affected (0.00 sec)
Records: 4  Duplicates: 0  Warnings: 0
mysql> select * from tinkster;
+---------+
| name    |
+---------+
| Sergio  |
| Sérgio  |
| ì       |
| ý       |
+---------+
4 rows in set (0.00 sec)

Yah, fine, good and well ... I can reproduce that. Now stick
those in a file, and import it.

[edit]
And I'm not being facetious - it's that I have ~ 1.8 Mio
rows that I'd like to import into the database, and no, I
won't be copy & pasting them into an interactive session.
[/edit]



Cheers,
Tink

Last edited by Tinkster; 02-02-2010 at 04:32 AM.
 
Old 02-02-2010, 04:26 AM   #11
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067

Original Poster
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Quote:
Originally Posted by carbonfiber View Post
Quickie #1: set the collation to utf8_bin (default is utf8_general_ci for a utf8 character set)
Quickie #2: set the character set to 'binary'

Look up the difference between utf8_general_ci and utf8_bin to see why.
And #1 leads to the same result as #2.

Thanks, anyway.


Cheers,
Tink
 
Old 02-02-2010, 04:30 AM   #12
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067

Original Poster
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Oh, and another update. MySQL 5.1.37 on WinXP has the same problem
that 5.0.84 on Slackware has. Importing accented characters from a
file appears (to me at least) impossible with MySQL, and it makes no
difference whether that's a delimited text file, or whether I wrap
the data into properly escaped insert statements.



Cheers,
Tink
 
Old 02-02-2010, 04:39 AM   #13
carbonfiber
Member
 
Registered: Sep 2009
Location: Sparta
Posts: 237

Rep: Reputation: 46
Code:
$ cat ~/DB 
Sergio
Sérgio
ì
ý

-------- -------- -------- --------

mysql> delete from tinkster;
Query OK, 4 rows affected (0.00 sec)

mysql> load data local infile '~/DB' into table tinkster;
Query OK, 4 rows affected (0.00 sec)
Records: 4  Deleted: 0  Skipped: 0  Warnings: 0

mysql> select * from tinkster;
+---------+
| name    |
+---------+
| Sergio  |
| Sérgio  |
| ì       |
| ý       |
+---------+
4 rows in set (0.00 sec)
mysql version: 5.1.42.

Last edited by carbonfiber; 02-02-2010 at 04:41 AM.
 
Old 02-02-2010, 04:44 AM   #14
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067

Original Poster
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Quote:
Originally Posted by carbonfiber View Post
Code:
$ cat ~/DB 
Sergio
Sérgio
ì
ý

-------- -------- -------- --------

mysql> delete from tinkster;
Query OK, 4 rows affected (0.00 sec)

mysql> load data local infile '~/DB' into table tinkster;
Query OK, 4 rows affected (0.00 sec)
Records: 4  Deleted: 0  Skipped: 0  Warnings: 0

mysql> select * from tinkster;
+---------+
| name    |
+---------+
| Sergio  |
| Sérgio  |
| ì       |
| ý       |
+---------+
4 rows in set (0.00 sec)
mysql version: 5.1.42.
Now that looks promising ... can you please post your
'locale' output and your my.cnf?

And what does 'file' say your DB file is?


Cheers,
Tink
 
Old 02-02-2010, 04:50 AM   #15
carbonfiber
Member
 
Registered: Sep 2009
Location: Sparta
Posts: 237

Rep: Reputation: 46
Sure.

Code:
$ locale
LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE=C
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=
my.cnf is the 'default' as shipped by Arch Linux: http://repos.archlinux.org/wsvn/pack...ra-i686/my.cnf (I don't usually run MySQL on this machine, I installed it just to see if I could help you fix your problem - that's why I'm using the default configuration file)

Code:
$ file DB 
DB: UTF-8 Unicode text
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to input non-utf characters from utf-8 linux enviroment? jadas Linux - General 6 02-07-2009 03:20 PM
Filenames with accented characters and XFS MQMan Linux - Software 2 07-31-2008 05:28 PM
Typing accented characters in Linux VicRic Linux - Newbie 3 05-21-2007 01:41 PM
Quanta accented characters spaniard Linux - Software 4 11-04-2005 12:08 AM
Accented characters in C Orkie Programming 2 05-17-2005 12:50 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 07:57 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration