2004.06_Bilingual Programming-Make It Easy for Multi-Language Applications.pdf

PROGRAMMING

Bilingual Programming

Second language

Linux is international. It was started

by a programmer from Finland who

speaks Swedish. Aided by a Welsh-

speaking lieutenant. Supplemented

with a kernel maintainer from Brazil.

So why is all our software written

in English? This month multi-lingual

development and the gettext

package. BY STEVEN GOODWIN

power in today’s society that Latin

did many hundreds of years ago.

It’s not the most expressive language, nor

is it the most popular. It certainly isn’t

the easiest to learn. It is, however, the

most widespread. With the remnants of

the old British Empire still present, and

the continued growth of America, people

are required to use English in order to

compete on the world stage.

Computers and the Internet have

increased this linguistic strangle-hold.

More web pages exist in English than any

other language. More programming lan-

guages use English words like if and

while , regardless of the designer’s nation-

ality. Most software uses prompts and

error messages that are written in English.

However, with Linux taking control of

many different systems across the globe,

it would appear to be xenophobic of us

to continue developing ‘English-only’

software. Adding the ability to change

the language (or locale ) of your software

is not a difficult task to achieve, but it

shows a wider commitment to your pro-

ject, and the open source community in

general. Even if you can not translate the

text yourself, you can make it easier for

someone else to do so by following the

guidelines in this article.

Turning Japanese

GNU/Linux uses a technique known as

locales to determine many things: the

appropriate translations for text, the

character set required to represent the

alphabet, and cultural specifics like the

expression of numbers, or the date. Each

area is considered in the box: Locale Cat-

egories, although the focus of this article

will be on text translation.

So let us start with the simplest pro-

gram we know, Hello World . We shall be

coding in C, although the same tech-

niques can be applied regardless of

language. You’ll be able to test the PHP

equivalent by using the code in box: In

PHP.

#include <stdio.h>

int main(int argc, char *argv[])

{

printf("Hello World!\n");

return 0;

}

It’s fairly obvious to us where the transla-

tion string will need to go. At compile

time, however, we do not know what the

replacement string will be, or what lan-

guages it will need to be in. This prevents

us from including any translation data

directly into program. Instead, we must

build up catalogs of each word and

phrase used by our program, and employ

the gettext package to act like a dictio-

nary. This will replace our (English)

words with the correct foreign version at

run time. What is ‘correct’ will be deter-

mined by the user’s specific locale.

We are required to do two things,

1. Mark the source code to say ‘get me

the correct words for the phrase XYZ’

2. Build a translation dictionary for

each language we need to support

Marking the source is a simple

process. We, as programmers, must work

through each line of the code and indi-

In PHP

Writing multi-lingual software in PHP is no

different from using C. The functions even

have the same name! However, when run as

part of a web page, it might be more suit-

able to specify the locale explicitly. Perhaps

coming from an session variable, or cookie

on the users machine.

<?php

setlocale(LC_ALL, "fr_FR");

textdomain("lm");

echo gettext("Hello World!\n");

The effect of setlocale can also be achieve by

using the putenv function.

putenv ("LANG=fr");

June 2004

www.linux-magazine.com

Bilingual Programming

T he English language holds the same

Bilingual Programming

PROGRAMMING

cate which lines of text will need trans-

lating. We can do this by calling a special

function (called, not surprisingly,

gettext ) that will consult the dictionary

and convert our string to something suit-

ably foreign.

translated. We shall shortly see a tool that

makes use of these markers itself to help

build the dictionary of translations. If we

were to build the dictionary manually

(but why would we?!), the gettext_noop

marker would be unnecessary.

Some programmers prefer to replace

this nine character marker with a single

character macro, such as the underscore.

This is because the word gettext (and

both brackets) can cause many lines to

break the 80 character limit. This is sim-

ply,

Locale Categories

A category defines a set of data, and every

supported language has its own set of data.

The category might define the way to

impart particular information: numbers

over 1000 might be separated by commas or

dots, for example, or the date might be writ-

ten day-month-year or month-day-year. This

information is not related to the language

as such, which is why the term ‘locale’ is

used, constituting both language and cul-

tural specifics. A directory is created for each

category.

There are standard functions to format

these locale strings. For example, strfmon

and strftime format the text for money and

time data, respectively.

Category Meaning

LC_COLLATE Order of string-collation

LC_CTYPE How to define characters. Echoes of

ctype.h as this also performs upper/

lower case conversion

LC_MESSAGES The translated text. The focus of this

article

LC_MONETARY Format and symbols for money

LC_NUMERIC

printf(gettext("Hello U

World!\n"));

This function can be found in the libintl

header file, so we must,

#include <libintl.h>

Compiling under GNU/Linux requires no

extra link libraries for the code to work.

The word GNU is essential here. That is

because the internationalization features

are included directly in glibc . Users of

other Unix-like systems may not be so

lucky. However, without a language cata-

log, no translations will be made. That

doesn’t matter at the moment, since the

English text will be output in all cases

where a translation can not be found. C

programmers will also note that this

method is not all-encompassing, because

there is more than one way to declare a

string. However, we’ve only learnt one

way to mark strings for translation. So

we will need to use another method, to

cope with those cases where a function

call to gettext would result in a syntax

error. For example,

#define _(str) gettext (str)

#define N_(str) gettext_ U

noop (str)

The GNU standard prefers a space

between function name and bracket, but

this is often omitted.

We can now move on and build our

foreign language dictionary.

Vienna Calling

Building a file that contains all the

strings in a program is not as time-con-

suming as you might think. Naturally, it

is a very common task, and can be

achieved by using a tool named xgettext .

This is one of the few instances where

the ‘x’ does not stand for an X Window

program. Instead, it is short for ‘extract’.

This program will search the source file

for any string used in conjunction with

the function call gettext (or gettext_noop )

and place the text into a catalog file

(ending the suffix .PO) ready to be trans-

lated. The program understands enough

about C, and about other languages (see

box: xgettext: Supported Languages), to

understand the syntax of a function call,

and differentiate it from variables and

comments.

Format and symbols for numbers

LC_TIME

Format and symbols for time and date

msgid "Hello World!\n"

msgstr ""

As you can see, each piece of text has a

marker ID and an equivalent string,

ready for translating. This string can

only hold a translation for one specific

language, so this file becomes a tem-

plate . Each translator takes a copy of it,

and translates the text within it to his or

her native tongue. Sometimes, this PO

file is renamed to POT to differentiate

between the template, and the language-

specific catalog files.

Note that xgettext will search for the

function name gettext . It does not under-

stand enough of the C syntax (or that

any language) to understand techniques

like #define _(str) , given above. This

doesn’t preclude the use of such tricks

however. There are two popular solu-

tions. One is to specify the underscore as

an additional keyword that will act in the

same manner as if it were gettext .

char *pHello = "Hello, U

World!\n";

To circumvent this problem, we need to

create a macro that includes a marker,

but has no adverse effect on the syntax.

#define gettext_noop(String) U

String

...

char *pHello = gettext_noop(" U

Hello, World!\n");

$ xgettext -d lm helloworld.c

$ tail -n 3 lm.po

#: helloworld.c:5

We then need to invoke the translation

module in the usual way, before we out-

put the string. Like so,

xgettext: Supported

Languages

C, C++, ObjectiveC

Python

Lisp, EmacsLisp

librep

Java

awk

YCP

Tc l

RST

Glade

printf (gettext (pHello) );

$ xgettext -d lm -k_ U

helloworld.c

These markers not only perform the

translation when the program is running,

but indicate to us what text needs to be

Alternatively, you could pre-process

your C file (causing the macro to be

expanded) before running xgettext .

www.linux-magazine.com

June 2004

PROGRAMMING

Bilingual Programming

$ xgettext -C -d lm <(gcc -E U

helloworld.c)

which highlights the deliberate mistake

above. Did you spot it? See Listing 1.

The first warning simply reminds us

that we haven’t changed the header

information yet. We can fix that by

amending the line to use the appropriate

characterization.

locale directory, be careful not to change

directory, as this path would then

become unreachable.

While in our local root directory, we

must create a locale directory, and copy

our lm.mo to the appropriate place in the

tree. That place being,

In this example we specify the -C flag , to

indicate that the piped result is a C

source file. Users of automake will have

an easier life, since the Makefile will gen-

erate these files automatically.

You will also note that the file contains

comments using the familiar hash sym-

bol. These comments come in four

flavors, and are determined by the char-

acter immediately following the hash, as

seen in Table 1.

The xgettext program can also add

comments into the PO file when, for

example, it believes the strings may be

used for special formatting. The PO file

also contains a header to indicate the

revision date of the file, and the transla-

tor that last edited it.

Having now gotten this template file,

we need to create a catalog for a foreign

language. Like French.

"Content-Type: text/plain; U

charset=ISO-8859-1\n"

$ mkdir -p locale/fr/LC_MESSAGES

$cplm.mo locale/fr/LC_MESSAGES

To determine an appropriate code you

can refer to the box: ISO 8859, or [1] for

a more detailed analysis. This informa-

tion is of more use to translators than

programmers. As is the extensive func-

tionality provided by [2].

The error itself is easily fixed, and in

larger programs, more difficult to spot by

humans. It can also check the strings for

the correct number (and type) of argu-

ments using the -c option. We’re now

ready to test it!

Since the package is called ‘lm.mo’ in

every language, we use the directory

name to distinguish between a French

lm.mo and a German lm.mo . This name

is determined by the conventional lan-

guage codes, as detailed at [3]. The

directory named LC_MESSAGES is

needed because of the wide variety of

different locale information that might

be present. There can also be directories

to indicate the format of the date, and

how to represent numbers. See box:

Locale Categories for a full list.

Now you can run your program (with-

out having to recompile), using a French

locale, and witness the result.

Norwegian Wood

In order to convince our program to use

an appropriate language dictionary, we

need to add a couple of further lines of

code to indicate that we’re happy about

using a locale. These are straight for-

ward, and common to all such programs.

Tour De France

We start by making a simple copy of the

template file, and adding the appropriate

French words to each msgstr .

$ LANG=fr_FR ./hello

Bonjour, le monde

msgid "Hello World!\n"

msgstr "Bonjour, le monde"

For a more permanent change of locale,

you must export the LANG environmen-

tal variable in the usual way. For

example,

We can add the string(s) by either modi-

fying the file directly, or using one of the

many tools available. Translators using

the Emacs editor have an advantage

here, since they may use PO mode. For

those who favor a GUI, the program

poeditor can also be used.

To be used by our Hello World pro-

gram, this text file needs to be converted

into a machine-friendly, binary, format.

The program that does this is called

msgfmt and creates a file (ending in .mo

instead of .po ) that is more optimal for

accessing arbitrary strings. It is not only

trivial to use, but includes error checking

#include <locale.h>

...

char *pPackage = "lm";

char *pDirectory = "locale";

...

setlocale (LC_ALL, "");

bindtextdomain (pPackage, U

pDirectory);

textdomain (pPackage);

$ export LANG=fr_FR

$ ./hello

Bonjour, le monde

If you’re on an exclusively English sys-

tem this may not work, due to the fact

there is no French locale on your system

(other potential problems are covered in

[4]). The /etc/locale.gen file will indicate

which locales have been generated for

your machine, whereas the file /usr/

share/i18n/SUPPORTED will indicate

which ones can be installed (along with

The bindtextdomain function indicates

the local root directory of our translated

catalog files, while textdomain requires

us to specify the name of our package , or

program. Ours is called ‘lm’, since we’ve

created an lm.mo catalog. Note that if

you specify a relative path for the

Table 1:Hash symbols

Listing 1: Finding an error

Character

Comment type Notes

. (period)

Automatic

Should not be touched

01 $ msgfmt lm.po

02 msgfmt: lm.po: warning: Charset "CHARSET" is not a portable encoding

name.

03 Message conversion to user's charset might not work.

04 lm.po:19: `msgid' and `msgstr' entries do not both end with '\n'

05 msgfmt: found 1 fatal error

:(colon)

Reference

The file & line number

of the string

,(comma)

Flag

To indicate the trans-

lation is ‘fuzzy’, for

example

(whitespace) Translator

As entered by a human

June 2004

www.linux-magazine.com

Bilingual Programming

PROGRAMMING

their appropriate ISO-8859 sets). Gener-

ating a French locale can be done easily

with,

Above is a common example to create a

plural. The case of ‘one file’ requires a

singular noun, whereas everything else

uses the plural, files . That’s in English!

Not all languages follow this pattern.

The case of ‘zero files’ might not be

plural (as in French), or there could be

separate words for zero, one and two

(such as those in the Baltic family). To

compensate for this, a separate function,

ngettext , is available which takes two

string ID’s (one for singular, and one for

plural) and a number. The number is

then used to determine which version of

that string should be used in translation.

come across problems that occur when

we use two or more arguments in a

printf , because the word order is impera-

tive. Even in a simple (English) program,

a mismatched %d and %s can cause

printf to core dump. After translating a

simple phrase, such as “There are %d

files named %s”, it is not unreasonable

for the resultant text to appear as “With

the name %s, there are %d files”. What’s

more, since we (as programmers) do not

know about every other possible transla-

tion, it is not something we can prevent.

More subtle problems can occur with

phrases like “Copying file from %s to

%s”.

There are two methods of resolving

the word order problem. The first

requires that the translator modify the

wording so that the arguments always

appear in the right order. The msgfmt

command can then be called using the -c

option, so that it will perform checks on

the .PO file. This option actually per-

forms three separate checks. They are,

format (the one we need in this

instance), header (the presence and con-

tents of the header) and domain

(checking for problems with the domain

directives).

The second solution places the onus

on the programmer, and is preferred. In

this case, the format string must be

amended to describe the order of the

parameters. So, using our copy file

example above, this would give us,

$su

# you must be root to do this

Password:

# echo "fr_FR ISO-8859-1" >> U

/etc/locale.gen

# locale-gen

Generating locales...

fr_FR.ISO-8859-1... done

Generation complete.

Debian users can also use dpkg-reconfig-

ure locales .

You can test this using your own pro-

gram, or (if you think the bug belongs to

Hello World!) one of the multi-lingual

GNU tools, such as rm .

printf( ngettext("Deleting %d U

file", "Deleting %d files", U

iNum), iNum);

$ LANG=fr_FR rm this_wont_exist

rm: Ne peut enlever `this_wont U

_exist': Aucun fichier ou U

répertoire de ce type

Upon seeing the ngettext marker, the

xgettext program will generate two string

IDs in the .PO file, ready for the transla-

tor, along with a special c-format

comment, which we’ll come to shortly.

To make your dictionary available to oth-

ers, you should install it into the global

repository of .mo files at /usr/share/

locale/ (or the location specified by the

environment variable, TEXTDOMAIN

DIR). This directory uses the same hier-

archy given above. Installing your text

here (which also requires superuser priv-

ileges) means your code no longer needs

to specify a directory to the bindtextdo-

main function, and you can replace the

directory name with NULL.

Having now understood the technical

process behind multi-lingual software,

let us review some of the finer details we

need to consider when programming.

#: helloworld.c:32

#, c-format

msgid "Deleting %d file"

msgid_plural "Deleting %d files"

msgstr[0] ""

msgstr[1] ""

Not all the problems are solved by nget-

text though. At some point you will

ISO 8859

printf( gettext ("Copying file U

from %1$s to %2$s"), U

pSrc, pDest);

ISO

Characterization

ISO 8859-1

Western, or west European

ISO 8859-2

Central European, or east European

ISO 8859-3

South European, or Maltese (and

Esperanto)

The special format specifiers, %1$s and

%2$s , are handled by the printf code in

glibc . Non-GNU variants may not be so

feature-full.

Having highlighted the word order

problem, you should now be aware that

constructing strings at run-time is a bad

idea. The solutions we have available to

us can only work when the entire string

is given to the translator. Splitting text up

into sections and using strcat (or similar)

should be avoided at all costs, since the

translator has no understanding of the

ordering (or the ability to change it), or

the meaning of the sentence. Each string

contained in the catalog must make

sense when presented on its own.

Spanish Eyes

Most developers have a method for deal-

ing with strings, like their favorite string

library, for example. They also have their

own methods for building strings

dynamically, either to add plurals, or

build large sentences from component

parts (like the verbal Lego of automated

train announcements). We shall now

cover a number of these methods, high-

lighting the problems (and solutions)

involved.

ISO 8859-4

North European

ISO 8859-5

Eastern European, Cyrillic alphabets

like Russian

ISO 8859-6

Arabic

ISO 8859-7

Greek

ISO 8859-8

Hebrew

ISO 8859-9

Turkish

ISO 8859-10

Nordic (Sámi, Inuit, Icelandic)

ISO 8859-11

Thai

ISO 8859-12

(was Celtic, but withdrawn)

ISO 8859-13

The Baltic Rim

ISO 8859-14

Celtic

ISO 8859-15

Euro

printf("Deleting %d file%s", U

iNum, iNum==1?"":"s");

ISO 8859-16

South eastern European (incorporates

euro symbol)

www.linux-magazine.com

June 2004

PROGRAMMING

Bilingual Programming

Unicode

the file listing, we would be doubling the

work for the translator! For instance, in

Listing 3.

That’s true. We are doubling the work!

However, this extra work is minimal.

Especially compared to the programmer

hassle that might otherwise be involved,

or the cringe-inducing gender misuse

when the wrong version of ‘the’ is

prepended to the words.

ing the longest piece of text in the left, or

you might need to word-wrap every-

thing. It might involve scrolling the text

within the visible window (like XMMS).

It might simply chop all characters that

overrun, and ask the translator for

shorter versions. The solution you

employ will vary according to the

amount of work you, and your transla-

tors, are willing to do. Only applications

that sell on their presentation abilities

(like games) should consider this a

necessity.

All the examples in this article use ASCII

characters. This covers most western lan-

guages, but neglects those character sets

requiring two bytes, such as Chinese. In

order to support them fully, we need to

work in Unicode. This involves a much larger

quantity of work, as the basic char type can

not be used, and is instead replaced by

wchar_t . Also, many of the well-known func-

tions (like sprintf) need to be adapted to use

their equivalent wide versions, like swprintf .

China Girl

The last implementation problem we

shall mention involves aesthetics. This

refers to the screen layout, the menus of

a GUI, and the use of tab stops. Although

your program may look nicely formatted

in English, as soon as any of the words

change, your pre-determined layout will

break. German words, for example, are

on average 50% longer than their Eng-

lish equivalents. You have two choices.

Either ignore word length, or code

around it.

Most (if not all) command line utilities

are unconcerned with special formatting.

The information is functional and uni-

form, making it suitable for parsing by

scripts. GUI software may explicitly

place text in two columns, at X1 and X2,

in order to appeal to the end user.

There’s nothing wrong with wanting to

appeal to the end user! Unfortunately,

when running under a different locale,

the text in the left column may overrun

the text in the right.

To avoid this problem you will need to

write some more code. This might

involve adjusting the position of the

right hand column, perhaps by calculat-

/* Don't code like this!! */

strcpy("Copying file from ");

strcat(pSrc);

strcat(" to ");

strcat(pDest);

Vienna

As software develops, more and more

strings will be added to the program. Re-

translating the whole program every

time is obviously wasted effort. So

instead, we should use the msgmerge

tool. This takes the original language

template (the .PO file, that’s often

renamed to .POT) without any transla-

tions, and the newest language-specific

catalog to build a new .PO. This new file

contains all the original translations,

combined with the new, as yet untrans-

lated, strings.

In some applications, the most difficult

word to translate is ‘the’! English has

only one word for the definite article,

‘the’. French, German, Spanish, and

many others don’t. Depending on the

language, they may have special versions

for masculine, feminine, neuter and

plural. The same is true of the indefinite

article, ‘a’. Normally, these words will be

included as part of the standard transla-

tion. By now you should have learnt that

building strings dynamically is not a

good idea. In some cases it can be very

tempting to cut down on the quantity of

translations required as in Listing 2.

We should modify this so that the

strings read ‘ a directory’ and ‘ a file’, so

the translated versions will work regard-

less of gender. However, you might

argue, if we also had a portion of the pro-

gram that produced a short version of

$ msgmerge old_po_file.pot U

current_language_po.po > U

new_language_po.po

Metropolis

With the gettext package, we can create

truly multi-lingual software, even if we

can’t speak any of the languages in ques-

tion. Using separate language catalogs

allows the translation work to be distrib-

uted amongst those who can speak

different tongues, without having to

recompile the code. This makes it a fully

data-driven, distributed, piece of devel-

opment work.

So with that thought I bid you all a

fond farewell. Au revoir. Auf Wiederse-

hen. Adiós and Arrivederci!

Listing 2: Smaller translations

01 if ( mygetfiletype(szFilename) == DIRECTORY)

02 pFiletype = gettext ("directory");

03 if ( mygetfiletype(szFilename) == FILE)

04 pFiletype = gettext ("file");

05 printf (gettext ("%1$s is a %2$s"), szFilename, pFiletype);

■

Listing 3: Doubling the work

INFO

[1] ISO8859 Alphabet Soup: http://wwwwbs.

cs.tu-belin.de/user/czyborra/charsets/

[2] Data on languages:

http://www.eki.ee/letter/

[3] Language codes: http://www.loc.gov/

standards/iso639-2/langcodes.html

[4] FAQ for GNU gettext:

http://www.haible.de/bruno/gettext-FAQ.

html#integrating_noop

01 if ( mygetfiletype(szFilename) == DIRECTORY)

02 pFiletype = gettext ("directory"); /* same strings as

before - does this mean less work? */

03 if ( mygetfiletype(szFilename) == FILE)

04 pFiletype = gettext ("file");

05 printf("%s : %s", szFilename, pFiletype);

/* no translation

required here */

June 2004

www.linux-magazine.com

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: