Friday, March 30, 2007

Polyglot content on the OLPC

Spent a good chunk of yesterday at OLPC re-teaching myself how to program in Python while making a library translator. I'm beginning to learn about internationalization, and spent far too long reading about gettext and now know why there are so many underscores inside printfs (print statements in the C language).

The short answer is that they allow substitution of strings within programs - for instance, _("Hello!") transforms into "Hello!" when you tell your software to build with English, and "Ba'ax ka wa'alik!" when you tell it to build in Mayan. When it's called, gettext performs a lookup-subsitution in the .po files you give it, which are basically lists of phrases you've got inside your program, written in a bunch of different languages.

.po files are apparently very widely used because they're so simple (they're literally just lists of translations of phrases in your program), but they break down when you try to translate larger bodies of information because there's no way to build structure into them. Every message has to have an unique ID. If you're building something on the scale of - say, the OLPC library, that's thousands and thousands of reference numbers floating around with no way to categorize them. Ouch. Okay, you could specify the format of the msgid and parse it to create a structural hierarchy, but... really, there must be a better way. Any ideas?

As part of translating the Biology page (read: while putzing around before writing the actual code that would translate the Biology page) I tried experimenting with XML to replace .po's and it seems to work. Kent Quirk pointed out that Python dictionaries would do the same thing (and they're what I'm using now), but the benefit of XML is that you can enforce template compliance with a DTD to make translations less random and spotty (like, you can't put in three different translations for the title in Spanish, you have to pick one) and it also fails more gracefully in case someone messes up while hand-editing... and translators are going to be hand-editing these files.

There's also Wikipedia's approach to multilingual coordination, which (as best I can tell) is "put lots of links to variants in different languages and let people figure it out and hand-link it themselves." Weirdly enough, each language's wiki is completely separate from the rest, meaning I'd have to create a separate account to edit the Chinese and English wikipedias. I see the rationale behind the split for content organizing reasons, but the inability to merge different accounts must be annoying for frequent translators there.

But yes. I learned a ton, I had a lot of fun, I started flaking the rust off my high-level programming fingers again (resulting in painfully slow coding yesterday, but it's getting better) and I'm going back. Hopefully over time my helpfulness-output will start outweighing my asks-stupid-questions-that-take-time-to-answer input. Hurrah, laptops!


Katie Rivard said...

Okay, you could specify the format of the msgid and parse it to create a structural hierarchy, but... really, there must be a better way. Any ideas?

If you *were* going to structure the msgid, you might want to use quad trees. Which are really cool.

Mel said...

Cool! I've seen quadtrees (or quadtree-like objects) used in collision detection algorithms before, but had no idea they had a name. It would go a long way towards keeping the hierarchy from getting unbalanced.

I wonder whether any taxonomy of human knowledge can remain naturally balanced, though. Seems like information growth is a weirdly unpredictable organic thing that defies attempts to put it in boxes. Wikipedia is messy, but it works (and I think the messiness is a big part of why it works, but that's a different topic).