Houston, we have 漢字 文字 Characters!

Kotoba achieved another important milestone this evening. Characters! As alluded to in previous posts, my work has been focused on developing a language-agnostic model of language characters. The work is heavily inspired on Kanjidict which is also the source of the 10,000+ chinese characters that are now a part of Kotoba.

MySQL + UTF-8 = Not So Obvious

Back in the day of MySQL 4, UTF-8 was a relatively straight-forward thing. You declared the charset to be utf-8 and behold! so it was. Fast-forward to today and ensuring your database is actually supporting UTF-8 is not so straight-forward.

I had noticed some issues with Kotoba that I thought resolved around collation (the linguistic strategy a database uses to sort strings) as evidenced by Japanese words not sorting in their natural-order (e.g. あいうえお等). Namely, I just chalked it up to a simple mis-configuration. However, in developing character (read Kanji!) functionality this past week (see the next post!), I noticed some more aberrant behavior. In particular, when searching for kanji I would get false positives. For example, if I searched for 会 I also got a few other kanji; a linguistically non-sensical result. I started to dig a bit deeper, and to my disappoint much of the underlying data in Kotoba was corrupted.

The issue, at it is core, is that while I had configured the database for UTF-8, I had missed a few spots that fundamentally enabled the database to store non-UTF-8 characters: a very subtle form of 文字化け.

The existing literature on the topic including possible fixes [here, here, here, here, here, here, horror story here, here] encourage the configuration of the MySQL daemon (mysqld) to use skip-character-set-client-handshake. This effectively ignores the client handshake with the server, and forces communications to use utf-8. This can be configured as below in your MySQL my.cnf file, or:


default-character-set = utf8
character-set-server = utf8
collation-server = utf8_general_ci
init-connect = SET NAMES utf8

However, you may not have access to configure your MySQL server as is currently the case with Kotoba’s shared hosting. So how does one get around this problem? In my case, ensure my client (Rails 2.2) strictly enforces utf-8.

Modify database.yml to include encoding: utf8, or:

  adapter: mysql
  database: DATABASE
  username: USERNAME
  password: PASSWORD
  socket: /tmp/mysql.sock
  encoding: utf8
  timeout: 5000

Add to environment.rb the line:


Add to application.rb the following:

  before_filter :set_charset

  def set_charset
    headers["Content-Type"] = "text/html; charset=UTF-8"

And if you really are paranoid, and you should be, then I would also add (blatant copy of Artuaz’s set_names_utf8) lib/my_app_utf8.rb:

module MyAppUtf8
  class SetNamesUtf8
    def self.filter(controller)
      suppress(ActiveRecord::StatementInvalid) do
        ActiveRecord::Base.connection.execute 'SET NAMES UTF8'

And also update init.rb to include:

## Ensure that all SQL queries use UTF8
ActionController::Base.send :prepend_before_filter, MyAppUtf8::SetNamesUtf8
suppress(ActiveRecord::StatementInvalid) do
  ActiveRecord::Base.connection.execute 'SET NAMES UTF8'

For awhile I tried to convert the data from its original charset to utf8; however, all the tools I used (including iconv and charguess) did not help me recover the data. In the end, I dumped all the corrupted tables and re-imported them. All data is now sufficiently remedied. Lesson learned? Check!

Sexy Models

Well, maybe sexy is too strong a word for most people. But if you are like me, then data modeling is sexy. And if that is the case then Kotoba can be particularly sexy.

One of the main objectives at this point in Kotoba’s development is ensuring the entity-relationships are modeled correctly. While no easy task, it is something that has been growing organically through trial and error and a fair bit of research into other projects.

Remember, while Kotoba is currently geared toward Japanese one of the goals is to ensure, as best as possible, something that can be more universal. That said, ironically, the most challenging task thus far is to sufficiently normalize characters rather than for words due to the numerous normalized, language-based character attributes that one might wish to track.

Overview of Kotoba's entity-relationships circa March 2009
Overview of Kotoba's entity-relationships circa March 2009. Click to enlarge.

How Do You Spell β?

In an attempt to help clarify where Kotoba is at, we have appropriately adorned Kotoba with a β (beta) moniker to help ensure everyone knows what to expect. And by “what to expect” I mean to expect Google’s definition of beta; usable but constantly growing in functionality.

How Kotoba says β in Japanese
Kotoba says β in
Kotoba Says β in English
Kotoba Says β in

For the curious, I used Gimp and this great tutorial to help create the images.

Kotoba Making Friends

Kotoba wants to be your friend!

In what will be an interesting experiment in social networking, Kotoba now befriends other Twitter users who make posts that match a predefined list of words or the word of the moment. The idea is for Kotoba to reach out to like-minded Twitters in hopes they are interested in learning a new language. Let me know what you think: for, against, or indifferent.