Back in the day of MySQL 4, UTF-8 was a relatively straight-forward thing. You declared the charset to be utf-8 and behold! so it was. Fast-forward to today and ensuring your database is actually supporting UTF-8 is not so straight-forward.
I had noticed some issues with Kotoba that I thought resolved around collation (the linguistic strategy a database uses to sort strings) as evidenced by Japanese words not sorting in their natural-order (e.g. あいうえお等). Namely, I just chalked it up to a simple mis-configuration. However, in developing character (read Kanji!) functionality this past week (see the next post!), I noticed some more aberrant behavior. In particular, when searching for kanji I would get false positives. For example, if I searched for 会 I also got a few other kanji; a linguistically non-sensical result. I started to dig a bit deeper, and to my disappoint much of the underlying data in Kotoba was corrupted.
The issue, at it is core, is that while I had configured the database for UTF-8, I had missed a few spots that fundamentally enabled the database to store non-UTF-8 characters: a very subtle form of 文字化け.
The existing literature on the topic including possible fixes [here, here, here, here, here, here, horror story here, here] encourage the configuration of the MySQL daemon (mysqld) to use
skip-character-set-client-handshake. This effectively ignores the client handshake with the server, and forces communications to use utf-8. This can be configured as below in your MySQL
my.cnf file, or:
[client] default-character-set=utf8 [mysqld] default-character-set = utf8 skip-character-set-client-handshake character-set-server = utf8 collation-server = utf8_general_ci init-connect = SET NAMES utf8
However, you may not have access to configure your MySQL server as is currently the case with Kotoba’s shared hosting. So how does one get around this problem? In my case, ensure my client (Rails 2.2) strictly enforces utf-8.
database.yml to include
encoding: utf8, or:
development: adapter: mysql database: DATABASE username: USERNAME password: PASSWORD socket: /tmp/mysql.sock encoding: utf8 timeout: 5000
environment.rb the line:
$KCODE = "UTF8"
application.rb the following:
before_filter :set_charset def set_charset headers["Content-Type"] = "text/html; charset=UTF-8" end
And if you really are paranoid, and you should be, then I would also add (blatant copy of Artuaz’s set_names_utf8)
module MyAppUtf8 class SetNamesUtf8 def self.filter(controller) suppress(ActiveRecord::StatementInvalid) do ActiveRecord::Base.connection.execute 'SET NAMES UTF8' end true end end end
And also update
init.rb to include:
## ## Ensure that all SQL queries use UTF8 ## ActionController::Base.send :prepend_before_filter, MyAppUtf8::SetNamesUtf8 suppress(ActiveRecord::StatementInvalid) do ActiveRecord::Base.connection.execute 'SET NAMES UTF8' end
For awhile I tried to convert the data from its original charset to utf8; however, all the tools I used (including
charguess) did not help me recover the data. In the end, I dumped all the corrupted tables and re-imported them. All data is now sufficiently remedied. Lesson learned? Check!