{"id":50,"date":"2009-11-23T00:10:18","date_gmt":"2009-11-23T00:10:18","guid":{"rendered":"https:\/\/rwec.co.uk\/blog\/?p=50"},"modified":"2019-08-12T16:10:09","modified_gmt":"2019-08-12T15:10:09","slug":"atomic-translations-part-1","status":"publish","type":"post","link":"https:\/\/rwec.co.uk\/blog\/2009\/11\/atomic-translations-part-1\/","title":{"rendered":"Atomic Translations &#8211; Part 1 &#8211; Database i18n as a Data Type Problem"},"content":{"rendered":"<p>[<a href=\"https:\/\/rwec.co.uk\/blog\/2009\/12\/atomic-translations-part-2\/\">Part 2: Translated Text as a Data Type in PostgreSQL<\/a> | <a href=\"https:\/\/rwec.co.uk\/pg-atom\/i18n.html\">Prototype Postgres Library<\/a>]<\/p>\n<p>The most obvious task when making any application multi-lingual is to make the <abbr title=\"user interface\">UI<\/abbr> translatable; there are absolutely <strong>tons<\/strong> of options for this, libraries in all sorts of languages, recipes, discussions, etc. But for many applications there&#8217;s another task, just as important, which is taking a database, and allowing someone to translate the data inside that.<\/p>\n<p>There are various approaches to &#8220;database <abbr title=\"internationalization\">i18n<\/abbr>&#8221; too &#8211; generally involving re-designing your schema in such a way that the translations are to some extent part of the &#8220;normalised&#8221; data. But these tend to make the schema somewhat unwieldy, and writing ad hoc queries and reports becomes tricky. So it occurred to me that if you treat i18n not as a <strong>schema<\/strong> problem, but as a <strong>data type<\/strong> problem, you could save some of that pain.<\/p>\n<p><!--more--><\/p>\n<h2>The Problem<\/h2>\n<p>Just to be clear, let&#8217;s have an example we can work through: if I have a table of countries in my database, and one of the columns is the name of the country, that name isn&#8217;t going to be the same in all the languages of my new super-duper internationalised system. So let&#8217;s say we start with this:<\/p>\n<table border=\"1\">\n<thead>\n<tr>\n<th style=\"align: center;\" colspan=\"3\">Table &#8220;country&#8221;<\/th>\n<\/tr>\n<tr>\n<th>id<\/th>\n<th>name<\/th>\n<th>currency<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>1<\/td>\n<td>United Kingdom<\/td>\n<td>GBP<\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>Germany<\/td>\n<td>EUR<\/td>\n<\/tr>\n<tr>\n<td>3<\/td>\n<td>France<\/td>\n<td>EUR<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Existing Solutions<\/h2>\n<p>I&#8217;ve come upon various approaches to the database <abbr title=\"internationalization\">i18n<\/abbr> problem, but they all group down to two basic ideas: using <strong>one big table<\/strong>, and treating <strong><abbr title=\"internationalization\">i18n<\/abbr> as a problem of normalization<\/strong>.<\/p>\n<h3>One Big Table<\/h3>\n<p>The basic idea of the &#8220;one big table&#8221; solution is to extract all the text that needs translating, and put it in one big list, ready for translation. This is, after all, how <abbr title=\"user interface\">UI<\/abbr> <abbr title=\"internationalization\">i18n<\/abbr> usually works, so why not just treat your data as &#8220;some more translation entries&#8221;? As an example, this appears to be (I haven&#8217;t actually experimented with it) what the <a href=\"http:\/\/book.cakephp.org\/view\/92\/Translate\">built-in i18n system in CakePHP<\/a> will do for you.<\/p>\n<p>So, for our example, we create a new &#8220;text&#8221; table which contains all the text we want to translate. The primary key on this table will be a combination of some kind of unique ID, plus a language code; something like this:<\/p>\n<table border=\"1\">\n<thead>\n<tr>\n<th style=\"align: center;\" colspan=\"3\">Table &#8220;text&#8221;<\/th>\n<\/tr>\n<tr>\n<th>id<\/th>\n<th>language_code<\/th>\n<th>content<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<\/tr>\n<tr>\n<td>42<\/td>\n<td>en<\/td>\n<td>United Kingdom<\/td>\n<\/tr>\n<tr>\n<td>42<\/td>\n<td>fr<\/td>\n<td>Royaume Uni<\/td>\n<\/tr>\n<tr>\n<td>42<\/td>\n<td>de<\/td>\n<td>Vereinigte K\u00f6nigreich<\/td>\n<\/tr>\n<tr>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>(Here, item 42 is &#8220;the name of the UK&#8221;, and we have translations for that item in English, French, and German.)<\/p>\n<p>Now probably the simplest way to use this is to replace the &#8220;name&#8221; column in our original &#8220;country&#8221; table with a reference to this new table &#8211; giving us something like this:<\/p>\n<table border=\"1\">\n<thead>\n<tr>\n<th style=\"align: center;\" colspan=\"3\">Table &#8220;country&#8221;<\/th>\n<\/tr>\n<tr>\n<th>id<\/th>\n<th>name_text_id<\/th>\n<th>currency<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>1<\/td>\n<td>42<\/td>\n<td>GBP<\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>43<\/td>\n<td>EUR<\/td>\n<\/tr>\n<tr>\n<td>3<\/td>\n<td>44<\/td>\n<td>EUR<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>This looks like nicely normalised data &#8211; you can set up a foreign key from the &#8220;name_text_id&#8221; to the &#8220;text&#8221; table, and so forth. But what kind of a table is &#8220;text&#8221;? It has nothing to do with your schema &#8211; it&#8217;s not really <em>representing anything<\/em>; if you think about it, you could just as well have a table called &#8220;numbers&#8221; as well, and reference it whenever you want a numeric value&#8230;<\/p>\n<p>A slightly neater variant of this is to replace the &#8220;id&#8221; in the &#8220;text&#8221; column with some reference <em>back<\/em> to the table and column being translated. Maybe we already have a table holding all our UI messages, appropriately categorised. So, with a bit of extra normalisation, we might end up with this:<\/p>\n<table border=\"1\">\n<thead>\n<tr>\n<th style=\"align: center;\" colspan=\"2\">Table &#8220;country&#8221;<\/th>\n<\/tr>\n<tr>\n<th>id<\/th>\n<th>currency<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>1<\/td>\n<td>GBP<\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>EUR<\/td>\n<\/tr>\n<tr>\n<td>3<\/td>\n<td>EUR<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table border=\"1\">\n<thead>\n<tr>\n<th style=\"align: center;\" colspan=\"5\">Table &#8220;text_item&#8221;<\/th>\n<\/tr>\n<tr>\n<th>id<\/th>\n<th>type<\/th>\n<th>category<\/th>\n<th>section<\/th>\n<th>item<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<\/tr>\n<tr>\n<td>42<\/td>\n<td>database<\/td>\n<td>country<\/td>\n<td>name<\/td>\n<td>1<\/td>\n<\/tr>\n<tr>\n<td>43<\/td>\n<td>database<\/td>\n<td>country<\/td>\n<td>name<\/td>\n<td>2<\/td>\n<\/tr>\n<tr>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table border=\"1\">\n<thead>\n<tr>\n<th style=\"align: center;\" colspan=\"3\">Table &#8220;text_translation&#8221;<\/th>\n<\/tr>\n<tr>\n<th>text_item_id<\/th>\n<th>language_code<\/th>\n<th>content<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<\/tr>\n<tr>\n<td>42<\/td>\n<td>en<\/td>\n<td>United Kingdom<\/td>\n<\/tr>\n<tr>\n<td>42<\/td>\n<td>fr<\/td>\n<td>Royaume Uni<\/td>\n<\/tr>\n<tr>\n<td>42<\/td>\n<td>de<\/td>\n<td>Vereinigte K\u00f6nigreich<\/td>\n<\/tr>\n<tr>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>This is quite attractive &#8211; all your translations are in one place, so writing an interface to keep them up to date should be quite easy &#8211; but as well as the meaninglessness of the &#8220;text&#8221; table, we now have another oddity, which I call the &#8220;<strong>anorexic table problem<\/strong>&#8220;.<\/p>\n<p>The problem is that the &#8220;country&#8221; table has given away most of its data to the &#8220;text_translation&#8221; table &#8211; without joining through, you can only learn the currency. And this is a mild case: you could start off with a table with 5 columns, decide all of them except the primary key need translating, and be left with a glorified sequence &#8211; a table with nothing but a primary key.<\/p>\n<p>This isn&#8217;t unique to i18n &#8211; indeed many extreme forms of normalisation pretty much require you to create such tables &#8211; but in practice, it&#8217;s a nuisance, and a downside of a lot of i18n schemas.<\/p>\n<h3>I18n as N11n<\/h3>\n<p>The other set of i18n schemas I&#8217;ve come across can be broadly called &#8220;normalisation-based solutions&#8221; &#8211; they take the existing schema, add multiple languages to it, and apply the rules of normalisation to the result.<\/p>\n<p>There are various ways this can end up looking, but the idea is to treat the translations on each table as you would any 1-to-many relationship: with a new table. So for our example, we might create this:<\/p>\n<table border=\"1\">\n<thead>\n<tr>\n<th style=\"align: center;\" colspan=\"2\">Table &#8220;country&#8221;<\/th>\n<\/tr>\n<tr>\n<th>id<\/th>\n<th>currency<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>1<\/td>\n<td>GBP<\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>EUR<\/td>\n<\/tr>\n<tr>\n<td>3<\/td>\n<td>EUR<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table border=\"1\">\n<thead>\n<tr>\n<th style=\"align: center;\" colspan=\"3\">Table &#8220;country_names&#8221;<\/th>\n<\/tr>\n<tr>\n<th>country_id<\/th>\n<th>language_code<\/th>\n<th>content<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<\/tr>\n<tr>\n<td>1<\/td>\n<td>en<\/td>\n<td>United Kingdom<\/td>\n<\/tr>\n<tr>\n<td>1<\/td>\n<td>fr<\/td>\n<td>Royaume Uni<\/td>\n<\/tr>\n<tr>\n<td>1<\/td>\n<td>de<\/td>\n<td>Vereinigte K\u00f6nigreich<\/td>\n<\/tr>\n<tr>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<td>&#8230;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Now, this example only has one column to translate, but I&#8217;ve called my new table &#8220;country_name&#8221;, implying that if there were multiple columns to translate, I would have created multiple new tables. So now not only do I have an &#8220;anemic&#8221; <em>country<\/em> table, I have a huge number of new tables cluttering up my schema.<\/p>\n<p>You don&#8217;t have to go this far &#8211; you could instead create a &#8220;country_i18n&#8221; table, with columns for each of the bits of country information you need to store, which would at least keep this explosion down a bit. This seems to be (again, I&#8217;m afraid I haven&#8217;t experimented) <a href=\"http:\/\/www.symfony-project.org\/book\/1_0\/13-I18n-and-L10n\">what the Symfony framework offers you<\/a>.<\/p>\n<p>There are a few other variations on this, as well &#8211; I came upon <a href=\"http:\/\/scratchpad.wikia.com\/wiki\/Multilingual_Data_Structure\">this article with various examples<\/a> (demonstrated, oddly, in terms of XML, not a DBMS). One worth mentioning is the idea of <strong>base + translations<\/strong>, which avoids the &#8220;anorexic table problem&#8221; by leaving the base language in the original table, and only the translations in a new structure. This is certainly an interesting compromise &#8211; but it requires you to formally pick a single &#8220;base&#8221; language, and I suspect the inconsistency of where data is stored would make writing maintenance code trickier.<\/p>\n<h2>Translations as Sub-Atomic Particles<\/h2>\n<p>Both of the approaches I&#8217;ve mentioned, and all the variations within them, frame the underlying <em>problem<\/em> in the same terms: translations are additional pieces of data, and your schema needs to accommodate this data. But there is actually a different way of looking at the problem, and that is that the translations are all <em>representations<\/em> of the same <em>value<\/em> &#8211; and the act of translating them is about selecting a <em>representation<\/em>, not filtering and combining <em>additional data<\/em>.<\/p>\n<p>In other words, <strong>translated text is a data type<\/strong>.<\/p>\n<p>Now, one of the principles often cited in relation to databases is that of <em>atomicity<\/em> &#8211; values in the database should be <em>atomic<\/em>, with structure defined at a relational level, not within each value. But as many people have pointed out &#8211; see <a href=\"http:\/\/thoughts.j-davis.com\/2009\/09\/30\/choosing-data-types\/\">this blog post<\/a> for example &#8211; it is quite hard to pin down just what <em>atomic<\/em> means.<\/p>\n<p>If you&#8217;re not allowed <em>any<\/em> structure, you shouldn&#8217;t really be storing strings &#8211; they are, after all, an ordered list of characters. But, crucially, that has no bearing on the data you are trying to <em>represent <\/em>&#8211; you might want to model a &#8220;country&#8221;, but you don&#8217;t want to <em>model <\/em>a &#8220;name&#8221;, you just want to <em>store <\/em>and <em>use <\/em>it. And &#8211; in my view &#8211; having multiple translations of that name needn&#8217;t change that fact. As far as your schema is concerned, all the translations of a particular name form a single atom, and the individual strings are just a part of that value.<\/p>\n<p>So just as you don&#8217;t use the <strong>schema<\/strong> to link together the letters &#8216;G&#8217;, &#8216;e&#8217;, &#8216;r&#8217;, &#8216;m&#8217;, &#8216;a&#8217;, &#8216;n&#8217;, and &#8216;y&#8217;, you shouldn&#8217;t use it to link together the translations &#8216;Germany&#8217;, &#8216;Allemagne&#8217;, and &#8216;Deutschland&#8217;. Instead, you select a <strong>data type<\/strong> that represents the kind of data you are storing.<\/p>\n<h2>The Atomic I18n Solution<\/h2>\n<p>So if we go back to our original example, and look at what it would mean to treat i18n as a problem of data types, not schema design, what would it look like? Obviously, the table still has to change, but not by much &#8211; we just change the type of the &#8220;name&#8221; column to our new &#8220;translated text&#8221; type. Once we&#8217;ve stored our extra translations, we end up with something a bit like this:<\/p>\n<table border=\"1\">\n<thead>\n<tr>\n<th colspan=\"3\">Table &#8220;country&#8221;<\/th>\n<\/tr>\n<tr>\n<th>id<\/th>\n<th>name<\/th>\n<th>currency<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>1<\/td>\n<td>{&#8220;en&#8221;:&#8221;United Kingdom&#8221;, &#8220;fr&#8221;:&#8221;Royaume Uni&#8221;, &#8220;de&#8221;: &#8220;Vereinigte K\u00f6nigreich&#8221;}<\/td>\n<td>GBP<\/td>\n<\/tr>\n<tr>\n<td>2<\/td>\n<td>{&#8220;en&#8221;:&#8221;Germany&#8221;, &#8220;fr&#8221;:&#8221;Allemagne&#8221;, &#8220;de&#8221;:&#8221;Deutschland&#8221;}<\/td>\n<td>EUR<\/td>\n<\/tr>\n<tr>\n<td>3<\/td>\n<td>{&#8220;en&#8221;:&#8221;France&#8221;, &#8220;fr&#8221;:&#8221;France&#8221;, &#8220;de&#8221;:&#8221;Frankreich&#8221;}<\/td>\n<td>EUR<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Note that we don&#8217;t need to create any new tables &#8211; our schema is effectively unchanged &#8211; and we&#8217;ve avoided the &#8220;anorexic table problem&#8221;. We still have to update any existing queries to select the name in a particular language, rather than all of them at once, but we don&#8217;t have to pull more tables into the query just to see what we&#8217;re doing.<\/p>\n<p>Of course, this relies on a &#8220;translated text&#8221; type being available in our DBMS, along with all the functions we&#8217;ll need to interact with it.<\/p>\n<p><em>This post has grown somewhat longer than I expected, so I have split it into two parts; in <a href=\"https:\/\/rwec.co.uk\/blog\/2009\/12\/atomic-translations-part-2\/\">the next part<\/a>, I will go into the details of my prototype implementation, which aims to create a user-friendly translated_text type from standard features of PostgreSQL.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[Part 2: Translated Text as a Data Type in PostgreSQL | Prototype Postgres Library] The most obvious task when making any application multi-lingual is to make the UI translatable; there are absolutely tons of options for this, libraries in all sorts of languages, recipes, discussions, etc. But for many applications there&#8217;s another task, just as [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[51,46,49,50,47,48],"class_list":["post-50","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-atomicity","tag-database","tag-i18n","tag-internationalization","tag-postgres","tag-postresql","post-preview"],"_links":{"self":[{"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/posts\/50","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/comments?post=50"}],"version-history":[{"count":23,"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/posts\/50\/revisions"}],"predecessor-version":[{"id":373,"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/posts\/50\/revisions\/373"}],"wp:attachment":[{"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/media?parent=50"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/categories?post=50"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rwec.co.uk\/blog\/wp-json\/wp\/v2\/tags?post=50"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}