Rowan's World, et Cetera

Atomic Translations – Part 2 – Translated Text as a Data Type in PostgreSQL

by Rowan on 12 December, 2009

[Part 1: Database i18n as a Data Type Problem | Prototype Postgres Library]

In my last post (longer ago than I intended), I discussed the approaches I’ve come across to adding i18n to a database schema, and outlined my theory that the whole problem could be recast as one of data types. In order to explore how this would work in practice, I’ve put together a prototype “library” using only a standard install of PostgreSQL that shows how simple it could be, and why it makes sense.

Views and Functions

One of the things I haven’t really touched on with my examples so far is that you will not normally interact directly with all the tables or data created by any i18n solution. You don’t want to write double the code to perform the same query, just because it contains translated items, so you will generally try to wrap it up in a set of functions and/or views that provide a high level of abstraction.

What these functions look like will be closely related to the structure you’ve given your data, and some solutions lend themselves to simpler wrappers than others:

  • If you have a separate table for every translated column, you may end up with dozens of functions like “get_country_name( language_code, country_id )”
  • If you have a big table of text but reference items by category you might instead call “get_translated_item( ‘database’, ‘country’, ‘name’, country_id, language_code )
  • If you instead store the text_id as a foreign key reference, you might be able to simply say “Select translate( name_text_id, ‘de’ ) From country”

Another thing that will commonly be handled by functions or views is the fact that you won’t have a full set of translations of every item in every language, so your code needs to do something other than break when a value is missing. The general approach is to have a “fall-back” language – code somewhere that says “if there’s no French available, English will have to do”. For some languages, there might be multiple levels of fall-back, e.g. if BrazilianPortuguese is unavailable, try European Portuguese, then English as a last resort.

If you store your fall-back rules in the database itself, any of the above functions could be handling all this logic behind the scenes. You will rarely need the ability to prioritise languages differently for different users.

Modification

Most solutions lend themselves to some variation of the above for simplifying the retrieval of translated content, but it is not always easy to simplify the modification of that content, with so many tables involved.

In the “atomic i18n” model, all your data is stored inside the original tables, so the abstraction functions don’t need to know anything about the underlying data structure – they can just manipulate the values directly. So the function translate(item, language_code) just takes an item of type translated_text and looks for the appropriate language – the item might well be a particular table column, but it could just as easily be something calculated on the fly.

More significantly, the function set_translation(item, language_code, translation) can take a translated_text value, and return a new translated_text value, without updating any underlying tables. So to modify your data, you can just use INSERT and UPDATE queries, just as you would if it was a normal string:

  • To create a new value, use an INSERT statement, with an appropriate representation of the set of translations to be added:
    Insert Into country ( name, currency ) Values ( i18n.translation_pair(‘en’, ‘Austria’), ‘EUR’
    );
  • To update a value, use an UPDATE statement:
    Update country Set name = translation_set(name,  ‘de’, ‘Österreich’) Where id = 4
  • To delete a value, use a DELETE statement – no need to look for dependencies, or set up cascade triggers.

Environment Variables

One of the things that the above retrieval functions all need – for obvious reasons – is which language you are translating into. But when you come to using that in an application, that becomes a bit of a pain – every time you mention a translation function, you have to insert a placeholder or a variable from the calling application to say what the current user’s language is.

But what would be really nice is if you didn’t have to pass any languages at all – I just want to say “the current user speaks German; now go away and do this query, but in German please”. Or, more technically, to set an environment variable at the top of the query, and let the rest of the query silently refer to it.

Using Environment Variables in PL/pgSQL

When I started looking into this, I was working with Postgresql, and discovered three things:

  1. some of the procedural languages available for Postgresql, such as PL/Perl, have support for session-wide global variables
  2. the most common, PL/pgSQL, doesn’t
  3. my shared host only had PL/pgSQL installed.

So I’ve written a simple pair of functions – “env.set(item, value)” and “env.get(item)” – implemented using only PL/pgSQL. By wrapping them like this, I figure they could easily be replaced by completely equivalent – and much simpler – functions in PL/Perl, or whatever other language you may have to hand.

The implementation itself takes advantage of the way PostgreSQL handles temporary tables – namely, that they last for the entire “session” of queries, wherever they were created. So you can run one function that creates a temporary table, and another that accesses it, and they’ll “see” the same table – but if you do that twice in separate “sessions”, each session will have its own copy. Voilá, somewhere to put your environment variables.

Operators

The other thing you’ll notice about the examples so far, is that they are quite wordy – there are various functions involved, and in the middle of a complex query they could get quite confusing. Luckily, Postgres also supports custom operators, so you can replace some of that wordiness with punctuation.

Whether using operators rather than functions makes things more or less readable will depend on the context, but the simplest operation, requesting a translation, now becomes as simple as using the -> (translate) operator:

Select id, name->’en’ as name, currency From country;

Or using the environment variable approach, using the ->! operator, which means “translate to currently set language”:

Select i18n.set_lang(‘en’);

Select id, name->! as name, currency From country;

And, of course, operators can also make our updates simpler; this example uses || (combine) and ==> (create translation pair):

Update country Set name = name || (‘fr’ ==> ‘Autriche’) Where id = 4

Full Implementation

The best way to see how it all works is to play around with a working implementation, so I’ve put a fully documented “library” online as pg-atom: i18n. The implementation consists of the following parts:

  • an env schema, consisting of env.set and env.get functions, currently implemented in Pl/Pgsql, using temporary tables
  • an i18n schema, containing:
    • a language table, which defines the mapping from codes to numeric IDs and fall-back logic
    • a translated_text type, which is implemented as a standard Postgres text[] array, using the numeric language IDs as keys and the translation strings as values
    • a set of functions which implement the functionality for this new type
  • a set of operators for the more i18n functions; I haven’t declared these as inside the i18n schema, because that would make them usable only if you set your search_path variable first.

Limitations and Future Possibilities

It’s important to note that I haven’t as yet had a chance to test this in anything like a production environment, so it may still be a bit ropy. In particular, there may be performance issues, as I’m not sure how the functions and operators will interact with query plans, indexes, etc.

The other thing to point out is that while the abstraction is quite elegant, the underlying implementation is not ideal: Postgres arrays have strictly numeric, ordered, indexes, and cannot contain nulls, whereas I am effectively using them to emulate an associative array (or perhaps some kind of tuple). In fact, the only way to add something with a particular key to an array (in this case, the language ID is used as the key) is to repeatedly “pad” the array with dummy values until it is long enough; for this reason, the current implementation treats empty strings as though they were nulls.

There are other ways to simulate an associative array, but more tantalisingly, there is a module distributed with Postgres called hstore which implements true associative arrays. A better translated_text type could probably be built “on top of” that, and typecasts between translated_text and hstore values would be pretty handy.

<a href=”http://rwec.co.uk/blog/2009/11/atomic-translations-part-1/”>Part 1: Database i18n as a Data Type Problem</a>

6 thoughts on “Atomic Translations – Part 2 – Translated Text as a Data Type in PostgreSQL

  1. Pingback: rwec.co.uk » Atomic Translations – Part 1 – Database i18n as a Data Type Problem

  2. Werner says:

    Very interesting approach! Have you ever tried it with the “hstore”?

    Anyone tried it in production?

    Werner

  3. Rowan says:

    Hi!

    I haven’t had a chance to try hstore yet – I’ve only recently got myself a decent dev environment where I could play around with installing that kind of thing. It’s certainly something I should look into, as it’s likely to have plenty of advantages over my cobbled together array system.

    As for production use, I did persuade my colleagues to use it for some data like lists of resorts to use on a new Spanish website, and apart from a couple of inconsistencies with the way different Postgres versions handled the custom operators, which I need to look into, it seems to have worked out quite well. (I think the problem was with the wrong version of || being called, worked around by defining ||| instead…)

    Rowan

  4. Taai says:

    Very interesting and useful. But how to do a search, for example, SELECT * FROM “countries” WHERE “name” LIKE ‘Ger%’ ?

  5. Rowan says:

    Easy! :) A WHERE clause can have all the same syntax as a SELECT clause, so you just query against whichever translation you want:

    SELECT * FROM “countries” WHERE “name”->’en’ LIKE ‘Ger%’

    It does mean the function is being run on every column in the table, so there are performance considerations, but that’s true of most of the schemas I’ve seen for translating this kind of data (One Big Table, tables where you previously had columns, etc, all lead to much more complex query plans).

    I haven’t looked into it in detail, but I think you could build an index on (“name”->’en’) if you knew that you were going to query it in a way that would hit the index. Although if you were doing pattern matching, that might not help anyway…

    Meanwhile, I must get round to updating the version online with the fixes that have been made to get it working in production here.

  6. Krisztian says:

    This is a very interesting solution I’ve never seen before.

    There is a disadvantage that comes to my mind: In our company we use NHibernate to access database objects more easily. Like this (this is a 2 table solution):

    Product p = Product.TryFind(12345); // 12345 is the primary key
    ProductTranslation pt = p.GetTranslation(Languages.German); // Languages = enum type and German = ID of language German
    Sytem.Console.WriteLine(pt.Name);

    Your approach is using self-made operators and functions that are not supported by ORMs (like NHibernate). Meaning that we would have to use plain SQL for all queries, since the ORM does not know of your extension and can’t use it.

    This would be very nice to have:

    ORM.SetLanguage(Languages.German); // calling your i18n.set_lang function (for web pages calling it only once at BeginRequest)
    Product p = Product.TryFind(12345);
    System.Console.WriteLine(p.Name); // prints product name in German automagically

    One would need to extend the NHibernate library. Unfortunately this functionality would be only usable with pgSQL as a database system after installing your extension.

Leave a Reply

Your email address will not be published. Required fields are marked *