Good is the enemy of Great
Latin-1 is the enemy of UTF-8
You write web apps. You understand the web is global, and want to support internationalization. You want UTF-8.
UTF-8 is extremely sane. Well, as sane as an encoding can be that features backwards-compatibility with ASCII.
Everything you care about supports UTF-8. Trust me: you want it everywhere.
Problem is, every last part of the web-application stack will fight you on your quest towards UTF-8 purity. What follows is a playbook to win your pervasive-UTF-8 battle.
First, you’re going to need diagnostic tools. There are two main weapons:
The programs you use to view text, be it dynamic from a tool’s output (Console.app) or a static file like a database dump (TextEdit, BBEdit, TextMate), have encoding logic. They will attempt to auto-detect encoding and paint you a pretty picture.
Avoid them. When debugging, you don’t want a pretty picture, you want The Truth. You need to be able to see raw byte-streams to debug this stuff.
A common problem is mixed encodings. That is, a file or stream that says it’s UTF-8 but has a chunk of Latin-1 in it. This is invisible corruption since most software won’t alert you when it hits mixed encodings (BBEdit is a notable exception).
Using a hex editor or viewing raw hex streams allows you to spot when a character that should be taking up three bytes (UTF-8) is only taking one (Latin-1).
A Unicode Canary-in-a-Coal-Mine.
You need a chunk of data that exercises the Unicode system: a sentinel value that you can push through your stack and make sure it survives a round-trip intact.
Initially I went with something like “tésting”, but it turns out that’s not enough — it will losslessly survive undesired transcoding to Latin-1 and back again.
No, you need something hard-core: “Iñtërnâtiônàlizætiøn” (complete with curly quotes).
(If you can’t read that word in your browser, it looks like the word “Internationalization” that’s had an umlaut omelet thrown in its face, and you’ve discovered an yet another encoding error somewhere between where I’m typing this and where you’re reading it.)
“Iñtërnâtiônàlizætiøn” is a great word to push through your systems because it can’t be represented in Latin-1 and will catch all sorts of hidden failure scenarios. Coupled the viewing raw hex, there’s no place for encoding bugs to hide.
(For the record, “Iñtërnâtiônàlizætiøn” looks like E2 80 9C 49 C3 B1 74 C3 AB 72 6E C3 A2 74 69 C3 B4 6E C3 A0 6C 69 7A C3 A6 74 69 C3 B8 6E E2 80 9D in UTF-8 in hex.)
※ ※ ※
OK, those are your weapons. Now for some concrete tips, starting from the bottom-up:
MySQL DDL: MySQL uses Latin1 by default. You need to set
default charset to
drop table if exists t_my_table; create table t_my_table ( ... ) engine=innodb default charset=utf8 collate=utf8_unicode_ci;
The major gotcha here is that if you fail to specify
default charset=utf8 in your DDL, it will default to Latin1 but simple storing and retrieval of UTF-8 will still work.
This is because there are no invalid characters in Latin-1 (well, except for NUL (0x00)). You can jam anything in there and MySQL will dutifully store it for you and give it back when asked.
No errors, no warnings.
MySQL Importing/Restoration: Consider the following file,
drop table if exists myutf8_table; create table myutf8_table ( demo varchar(255) ) engine=innodb default charset=utf8 collate=utf8_unicode_ci; insert into myutf8_table values ('“Iñtërnâtiônàlizætiøn”');
Now let’s load it up:
mysql -e 'drop database if exists myutf8_db;create database myutf8_db;' mysql myutf8_db < myutf8.sql
You’ve already failed.
myutf8.sql’s file encoding was UTF-8, but no one told
mysql that. So
mysql assumed Latin-1 and corrupted the data.
myutf8_table’s lonely single row in Querious, I see it has a value of
â€œIÃ±tÃ«rnÃ¢tiÃ´nÃ lizÃ¦tiÃ¸nâ€ — a far cry from the
“Iñtërnâtiônàlizætiøn” value we intended.
Fortunately it’s easy to instruct
mysql that an input file has a specific encoding:
mysql -e 'drop database if exists myutf8_db;create database myutf8_db;' mysql --default-character-set=utf8 myutf8_db < myutf8.sql
--default-character-set=utf8 makes all the difference. I recommend using it all the time — I’ve gotten to the point where I’m nervous if I spot an invocation of
mysql that lacks an explicit
MySQL Exporting/Backup: Use
--default-character-set=utf8 like you do when importing:
mysqldump --user=root --opt --default-character-set=utf8 myutf8_db
Relately, Mo McRoberts has a nice post on when MySQL encodings go bad.
JDBC Connection URL: It’s been a while since I’ve used Java, but it looks like you want to set two options,
HTTP Headers: Your web server should vend a
HTML Documents: In theory your web server should be configured to declare all your HTML content as UTF-8 with its
Content-Type HTTP header, but unfortunately that’s not always something you can control. You can also declare your UTF-8 conformance in the HTML document itself with a
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"> </head> <body> </body> </html>
HTML Forms: Specify
accept-charset in your
<form> tag to tell the browser to submit user-entered data encoded in UTF-8:
<form action="foo" accept-charset="UTF-8">...</form>
Ajax/XHR/XMLHTTPRequest: Don’t sweat it, the W3C XMLHTTPRequest standard specifies POST data will always be encoded with the UTF-8 charset.