[Part 2: Translated Text as a Data Type in PostgreSQL | Prototype Postgres Library]

The most obvious task when making any application multi-lingual is to make the UI translatable; there are absolutely tons of options for this, libraries in all sorts of languages, recipes, discussions, etc. But for many applications there’s another task, just as important, which is taking a database, and allowing someone to translate the data inside that.

There are various approaches to “database i18n” too – generally involving re-designing your schema in such a way that the translations are to some extent part of the “normalised” data. But these tend to make the schema somewhat unwieldy, and writing ad hoc queries and reports becomes tricky. So it occurred to me that if you treat i18n not as a schema problem, but as a data type problem, you could save some of that pain.

The Problem

Just to be clear, let’s have an example we can work through: if I have a table of countries in my database, and one of the columns is the name of the country, that name isn’t going to be the same in all the languages of my new super-duper internationalised system. So let’s say we start with this:

Table “country”
id name currency
1 United Kingdom GBP
2 Germany EUR
3 France EUR

Existing Solutions

I’ve come upon various approaches to the database i18n problem, but they all group down to two basic ideas: using one big table, and treating i18n as a problem of normalization.

One Big Table

The basic idea of the “one big table” solution is to extract all the text that needs translating, and put it in one big list, ready for translation. This is, after all, how UI i18n usually works, so why not just treat your data as “some more translation entries”? As an example, this appears to be (I haven’t actually experimented with it) what the built-in i18n system in CakePHP will do for you.

So, for our example, we create a new “text” table which contains all the text we want to translate. The primary key on this table will be a combination of some kind of unique ID, plus a language code; something like this:

Table “text”
id language_code content
42 en United Kingdom
42 fr Royaume Uni
42 de Vereinigte Königreich

(Here, item 42 is “the name of the UK”, and we have translations for that item in English, French, and German.)

Now probably the simplest way to use this is to replace the “name” column in our original “country” table with a reference to this new table – giving us something like this:

Table “country”
id name_text_id currency
1 42 GBP
2 43 EUR
3 44 EUR

This looks like nicely normalised data – you can set up a foreign key from the “name_text_id” to the “text” table, and so forth. But what kind of a table is “text”? It has nothing to do with your schema – it’s not really representing anything; if you think about it, you could just as well have a table called “numbers” as well, and reference it whenever you want a numeric value…

A slightly neater variant of this is to replace the “id” in the “text” column with some reference back to the table and column being translated. Maybe we already have a table holding all our UI messages, appropriately categorised. So, with a bit of extra normalisation, we might end up with this:

Table “country”
id currency
1 GBP
2 EUR
3 EUR
Table “text_item”
id type category section item
42 database country name 1
43 database country name 2
Table “text_translation”
text_item_id language_code content
42 en United Kingdom
42 fr Royaume Uni
42 de Vereinigte Königreich

This is quite attractive – all your translations are in one place, so writing an interface to keep them up to date should be quite easy – but as well as the meaninglessness of the “text” table, we now have another oddity, which I call the “anorexic table problem“.

The problem is that the “country” table has given away most of its data to the “text_translation” table – without joining through, you can only learn the currency. And this is a mild case: you could start off with a table with 5 columns, decide all of them except the primary key need translating, and be left with a glorified sequence – a table with nothing but a primary key.

This isn’t unique to i18n – indeed many extreme forms of normalisation pretty much require you to create such tables – but in practice, it’s a nuisance, and a downside of a lot of i18n schemas.

I18n as N11n

The other set of i18n schemas I’ve come across can be broadly called “normalisation-based solutions” – they take the existing schema, add multiple languages to it, and apply the rules of normalisation to the result.

There are various ways this can end up looking, but the idea is to treat the translations on each table as you would any 1-to-many relationship: with a new table. So for our example, we might create this:

Table “country”
id currency
1 GBP
2 EUR
3 EUR
Table “country_names”
country_id language_code content
1 en United Kingdom
1 fr Royaume Uni
1 de Vereinigte Königreich

Now, this example only has one column to translate, but I’ve called my new table “country_name”, implying that if there were multiple columns to translate, I would have created multiple new tables. So now not only do I have an “anemic” country table, I have a huge number of new tables cluttering up my schema.

You don’t have to go this far – you could instead create a “country_i18n” table, with columns for each of the bits of country information you need to store, which would at least keep this explosion down a bit. This seems to be (again, I’m afraid I haven’t experimented) what the Symfony framework offers you.

There are a few other variations on this, as well – I came upon this article with various examples (demonstrated, oddly, in terms of XML, not a DBMS). One worth mentioning is the idea of base + translations, which avoids the “anorexic table problem” by leaving the base language in the original table, and only the translations in a new structure. This is certainly an interesting compromise – but it requires you to formally pick a single “base” language, and I suspect the inconsistency of where data is stored would make writing maintenance code trickier.

Translations as Sub-Atomic Particles

Both of the approaches I’ve mentioned, and all the variations within them, frame the underlying problem in the same terms: translations are additional pieces of data, and your schema needs to accommodate this data. But there is actually a different way of looking at the problem, and that is that the translations are all representations of the same value – and the act of translating them is about selecting a representation, not filtering and combining additional data.

In other words, translated text is a data type.

Now, one of the principles often cited in relation to databases is that of atomicity – values in the database should be atomic, with structure defined at a relational level, not within each value. But as many people have pointed out – see this blog post for example – it is quite hard to pin down just what atomic means.

If you’re not allowed any structure, you shouldn’t really be storing strings – they are, after all, an ordered list of characters. But, crucially, that has no bearing on the data you are trying to represent – you might want to model a “country”, but you don’t want to model a “name”, you just want to store and use it. And – in my view – having multiple translations of that name needn’t change that fact. As far as your schema is concerned, all the translations of a particular name form a single atom, and the individual strings are just a part of that value.

So just as you don’t use the schema to link together the letters ‘G’, ‘e’, ‘r’, ‘m’, ‘a’, ‘n’, and ‘y’, you shouldn’t use it to link together the translations ‘Germany’, ‘Allemagne’, and ‘Deutschland’. Instead, you select a data type that represents the kind of data you are storing.

The Atomic I18n Solution

So if we go back to our original example, and look at what it would mean to treat i18n as a problem of data types, not schema design, what would it look like? Obviously, the table still has to change, but not by much – we just change the type of the “name” column to our new “translated text” type. Once we’ve stored our extra translations, we end up with something a bit like this:

Table “country”
id name currency
1 {“en”:”United Kingdom”, “fr”:”Royaume Uni”, “de”: “Vereinigte Königreich”} GBP
2 {“en”:”Germany”, “fr”:”Allemagne”, “de”:”Deutschland”} EUR
3 {“en”:”France”, “fr”:”France”, “de”:”Frankreich”} EUR

Note that we don’t need to create any new tables – our schema is effectively unchanged – and we’ve avoided the “anorexic table problem”. We still have to update any existing queries to select the name in a particular language, rather than all of them at once, but we don’t have to pull more tables into the query just to see what we’re doing.

Of course, this relies on a “translated text” type being available in our DBMS, along with all the functions we’ll need to interact with it.

This post has grown somewhat longer than I expected, so I have split it into two parts; in the next part, I will go into the details of my prototype implementation, which aims to create a user-friendly translated_text type from standard features of PostgreSQL.