Rowan's World, et Cetera

Atomic Translations – Part 2 – Translated Text as a Data Type in PostgreSQL

by Rowan on 12 December, 2009

[Part 1: Database i18n as a Data Type Problem | Prototype Postgres Library]

In my last post (longer ago than I intended), I discussed the approaches I’ve come across to adding i18n to a database schema, and outlined my theory that the whole problem could be recast as one of data types. In order to explore how this would work in practice, I’ve put together a prototype “library” using only a standard install of PostgreSQL that shows how simple it could be, and why it makes sense.

Note: this article was written long before Postgres had native support for JSON data. An implementation for a modern version of Postgres would almost certainly make use of jsonb as a base type instead of text[]

Views and Functions

One of the things I haven’t really touched on with my examples so far is that you will not normally interact directly with all the tables or data created by any i18n solution. You don’t want to write double the code to perform the same query, just because it contains translated items, so you will generally try to wrap it up in a set of functions and/or views that provide a high level of abstraction.

What these functions look like will be closely related to the structure you’ve given your data, and some solutions lend themselves to simpler wrappers than others:

  • If you have a separate table for every translated column, you may end up with dozens of functions like “get_country_name( language_code, country_id )”
  • If you have a big table of text but reference items by category you might instead call “get_translated_item( ‘database’, ‘country’, ‘name’, country_id, language_code )
  • If you instead store the text_id as a foreign key reference, you might be able to simply say “Select translate( name_text_id, ‘de’ ) From country”

Another thing that will commonly be handled by functions or views is the fact that you won’t have a full set of translations of every item in every language, so your code needs to do something other than break when a value is missing. The general approach is to have a “fall-back” language – code somewhere that says “if there’s no French available, English will have to do”. For some languages, there might be multiple levels of fall-back, e.g. if BrazilianPortuguese is unavailable, try European Portuguese, then English as a last resort.

If you store your fall-back rules in the database itself, any of the above functions could be handling all this logic behind the scenes. You will rarely need the ability to prioritise languages differently for different users.

Modification

Most solutions lend themselves to some variation of the above for simplifying the retrieval of translated content, but it is not always easy to simplify the modification of that content, with so many tables involved.

In the “atomic i18n” model, all your data is stored inside the original tables, so the abstraction functions don’t need to know anything about the underlying data structure – they can just manipulate the values directly. So the function translate(item, language_code) just takes an item of type translated_text and looks for the appropriate language – the item might well be a particular table column, but it could just as easily be something calculated on the fly.

More significantly, the function set_translation(item, language_code, translation) can take a translated_text value, and return a new translated_text value, without updating any underlying tables. So to modify your data, you can just use INSERT and UPDATE queries, just as you would if it was a normal string:

  • To create a new value, use an INSERT statement, with an appropriate representation of the set of translations to be added:
    Insert Into country ( name, currency ) Values ( i18n.translation_pair(‘en’, ‘Austria’), ‘EUR’
    );
  • To update a value, use an UPDATE statement:
    Update country Set name = translation_set(name,  ‘de’, ‘Österreich’) Where id = 4
  • To delete a value, use a DELETE statement – no need to look for dependencies, or set up cascade triggers.

Environment Variables

One of the things that the above retrieval functions all need – for obvious reasons – is which language you are translating into. But when you come to using that in an application, that becomes a bit of a pain – every time you mention a translation function, you have to insert a placeholder or a variable from the calling application to say what the current user’s language is.

But what would be really nice is if you didn’t have to pass any languages at all – I just want to say “the current user speaks German; now go away and do this query, but in German please”. Or, more technically, to set an environment variable at the top of the query, and let the rest of the query silently refer to it.

Using Environment Variables in PL/pgSQL

When I started looking into this, I was working with Postgresql, and discovered three things:

  1. some of the procedural languages available for Postgresql, such as PL/Perl, have support for session-wide global variables
  2. the most common, PL/pgSQL, doesn’t
  3. my shared host only had PL/pgSQL installed.

So I’ve written a simple pair of functions – “env.set(item, value)” and “env.get(item)” – implemented using only PL/pgSQL. By wrapping them like this, I figure they could easily be replaced by completely equivalent – and much simpler – functions in PL/Perl, or whatever other language you may have to hand.

The implementation itself takes advantage of the way PostgreSQL handles temporary tables – namely, that they last for the entire “session” of queries, wherever they were created. So you can run one function that creates a temporary table, and another that accesses it, and they’ll “see” the same table – but if you do that twice in separate “sessions”, each session will have its own copy. Voilá, somewhere to put your environment variables.

Operators

The other thing you’ll notice about the examples so far, is that they are quite wordy – there are various functions involved, and in the middle of a complex query they could get quite confusing. Luckily, Postgres also supports custom operators, so you can replace some of that wordiness with punctuation.

Whether using operators rather than functions makes things more or less readable will depend on the context, but the simplest operation, requesting a translation, now becomes as simple as using the -> (translate) operator:

Select id, name->’en’ as name, currency From country;

Or using the environment variable approach, using the ->! operator, which means “translate to currently set language”:

Select i18n.set_lang(‘en’);

Select id, name->! as name, currency From country;

And, of course, operators can also make our updates simpler; this example uses || (combine) and ==> (create translation pair):

Update country Set name = name || (‘fr’ ==> ‘Autriche’) Where id = 4

Full Implementation

The best way to see how it all works is to play around with a working implementation, so I’ve put a fully documented “library” online as pg-atom: i18n. The implementation consists of the following parts:

  • an env schema, consisting of env.set and env.get functions, currently implemented in Pl/Pgsql, using temporary tables
  • an i18n schema, containing:
    • a language table, which defines the mapping from codes to numeric IDs and fall-back logic
    • a translated_text type, which is implemented as a standard Postgres text[] array, using the numeric language IDs as keys and the translation strings as values
    • a set of functions which implement the functionality for this new type
  • a set of operators for the more i18n functions; I haven’t declared these as inside the i18n schema, because that would make them usable only if you set your search_path variable first.

Limitations and Future Possibilities

It’s important to note that I haven’t as yet had a chance to test this in anything like a production environment, so it may still be a bit ropy. In particular, there may be performance issues, as I’m not sure how the functions and operators will interact with query plans, indexes, etc.

The other thing to point out is that while the abstraction is quite elegant, the underlying implementation is not ideal: Postgres arrays have strictly numeric, ordered, indexes, and cannot contain nulls, whereas I am effectively using them to emulate an associative array (or perhaps some kind of tuple). In fact, the only way to add something with a particular key to an array (in this case, the language ID is used as the key) is to repeatedly “pad” the array with dummy values until it is long enough; for this reason, the current implementation treats empty strings as though they were nulls.

There are other ways to simulate an associative array, but more tantalisingly, there is a module distributed with Postgres called hstore which implements true associative arrays. A better translated_text type could probably be built “on top of” that, and typecasts between translated_text and hstore values would be pretty handy.

<a href=”https://rwec.co.uk/blog/2009/11/atomic-translations-part-1/”>Part 1: Database i18n as a Data Type Problem</a>

6 thoughts on “Atomic Translations – Part 2 – Translated Text as a Data Type in PostgreSQL

  1. Pingback: rwec.co.uk » Atomic Translations – Part 1 – Database i18n as a Data Type Problem

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.