Kotoba achieved another important milestone this evening. Characters! As alluded to in previous posts, my work has been focused on developing a language-agnostic model of language characters. The work is heavily inspired on Kanjidict which is also the source of the 10,000+ chinese characters that are now a part of Kotoba.
Back in the day of MySQL 4, UTF-8 was a relatively straight-forward thing. You declared the charset to be utf-8 and behold! so it was. Fast-forward to today and ensuring your database is actually supporting UTF-8 is not so straight-forward.
I had noticed some issues with Kotoba that I thought resolved around collation (the linguistic strategy a database uses to sort strings) as evidenced by Japanese words not sorting in their natural-order (e.g. あいうえお等). Namely, I just chalked it up to a simple mis-configuration. However, in developing character (read Kanji!) functionality this past week (see the next post!), I noticed some more aberrant behavior. In particular, when searching for kanji I would get false positives. For example, if I searched for 会 I also got a few other kanji; a linguistically non-sensical result. I started to dig a bit deeper, and to my disappoint much of the underlying data in Kotoba was corrupted.
The issue, at it is core, is that while I had configured the database for UTF-8, I had missed a few spots that fundamentally enabled the database to store non-UTF-8 characters: a very subtle form of 文字化け.
The existing literature on the topic including possible fixes [here, here, here, here, here, here, horror story here, here] encourage the configuration of the MySQL daemon (mysqld) to use
skip-character-set-client-handshake. This effectively ignores the client handshake with the server, and forces communications to use utf-8. This can be configured as below in your MySQL
my.cnf file, or:
[client] default-character-set=utf8 [mysqld] default-character-set = utf8 skip-character-set-client-handshake character-set-server = utf8 collation-server = utf8_general_ci init-connect = SET NAMES utf8
However, you may not have access to configure your MySQL server as is currently the case with Kotoba’s shared hosting. So how does one get around this problem? In my case, ensure my client (Rails 2.2) strictly enforces utf-8.
database.yml to include
encoding: utf8, or:
development: adapter: mysql database: DATABASE username: USERNAME password: PASSWORD socket: /tmp/mysql.sock encoding: utf8 timeout: 5000
environment.rb the line:
$KCODE = "UTF8"
application.rb the following:
before_filter :set_charset def set_charset headers["Content-Type"] = "text/html; charset=UTF-8" end
And if you really are paranoid, and you should be, then I would also add (blatant copy of Artuaz’s set_names_utf8)
module MyAppUtf8 class SetNamesUtf8 def self.filter(controller) suppress(ActiveRecord::StatementInvalid) do ActiveRecord::Base.connection.execute 'SET NAMES UTF8' end true end end end
And also update
init.rb to include:
## ## Ensure that all SQL queries use UTF8 ## ActionController::Base.send :prepend_before_filter, MyAppUtf8::SetNamesUtf8 suppress(ActiveRecord::StatementInvalid) do ActiveRecord::Base.connection.execute 'SET NAMES UTF8' end
For awhile I tried to convert the data from its original charset to utf8; however, all the tools I used (including
charguess) did not help me recover the data. In the end, I dumped all the corrupted tables and re-imported them. All data is now sufficiently remedied. Lesson learned? Check!
Well, maybe sexy is too strong a word for most people. But if you are like me, then data modeling is sexy. And if that is the case then Kotoba can be particularly sexy.
One of the main objectives at this point in Kotoba’s development is ensuring the entity-relationships are modeled correctly. While no easy task, it is something that has been growing organically through trial and error and a fair bit of research into other projects.
Remember, while Kotoba is currently geared toward Japanese one of the goals is to ensure, as best as possible, something that can be more universal. That said, ironically, the most challenging task thus far is to sufficiently normalize characters rather than for words due to the numerous normalized, language-based character attributes that one might wish to track.
In an attempt to help clarify where Kotoba is at, we have appropriately adorned Kotoba with a β (beta) moniker to help ensure everyone knows what to expect. And by “what to expect” I mean to expect Google’s definition of beta; usable but constantly growing in functionality.
Kotoba wants to be your friend!
In what will be an interesting experiment in social networking, Kotoba now befriends other Twitter users who make posts that match a predefined list of words or the word of the moment. The idea is for Kotoba to reach out to like-minded Twitters in hopes they are interested in learning a new language. Let me know what you think: for, against, or indifferent.