Scripts and web pages to help on localisation and QA
I've made a bunch of perl/php scripts to help on localisation and its
QA. I'll present them to you and ask for your test and feedback.
The base of these tools is a perl script which take the en-US and
localised dtd and .properties from cvs and transforms them in a
localisation memory (tmx) file.
This script is put in a cron job every night.
The script will return two tables (en-US and localised (here french)
matches) with in the first column a concatenation of the main
directory, the name of the file and the entity. In the second column
it's the en-US string and in the third the localised one.
* Second script :
The second script is more complicated and need a pre treatment by a
perl script (run by cron every night).
For the entity updatesItem_downloadingFallback in the files
browser.properties (in browser) and calendar.properties (in calendar)
you have the same english string : "Downloading Update…"
but two different translated strings : "Téléchargement d'une mise à
jour…" and "Téléchargement de la mise à jour…"
If you want consistency in your localisation you'll want to eliminate these.
I'll plan (perhaps is someone is interested to help me) to semi
automatise the correction directly from the web page.
If the check box is checked, the script limits the search on same
entity name, if unchecked the list could be longer and sometimes with
The last php script is a more advanced glossary. It's more an help to
localiser. It try (very simply for now) to make an alignment between a
new string and the ones already localised. ie it search for similarities
between the new entry and the old ones.
You put an en-US string (for example a new entity just landed on the
trunk) in the search box and it will give you the best matches it can find.
For now it is a very basic search :
- in a first table it give you the perfect matches.
- on the second table it is the almost perfect matches : when the string
can be found whole in an entity.
- the last table will give you the entities where the words of your
search can be found (all the word for now).
In the future I'll try to implement better alignment research.
Some ideas are : research on some of the words only, give the
translation of each words or bunch of words (based on a research on the
corpus from the translation memory). For example in French "bookmarks"
is always translated as "Marque-pages".
Or apply alignment as treated on scientific papers.
Perhaps it's a good idea to include this sort of alignment in a future
localisation tool which will propose new translation for the newly
landed entities on the trunk ?
To find conflicting translations, we use poconflicts:
http://translate.sourceforge.net/wiki/toolkit/poconflicts It can find both conflicting translations of the same source text, or
identical target translations that correspond to different source texts.
So it can help both with inconsistencies and with newly introduced
ambiguities in the translation (neither of which is necessarily a
problem). Since we run this on our PO files created with the
accelerators merged into the translations, it also (optionally) allows
for checking the consistency of accelerator use.
Other classes of tests are performed with pofilter:
http://translate.sourceforge.net/wiki/toolkit/pofilter This tests for several things, some important, some more cosmetic that
might be useful to have your attention pointed to. For exapmle, I found
this in the File Philippe linked to:
<tuv xml:lang="en-US"><seg>Bookmark This Link… </seg></tuv>
<tuv xml:lang="fr-FR"><seg>Marquer ce lien </seg></tuv>
<note>endpunc: checks whether punctuation at the end of the strings
This just points attention to the fact that the punctuation at the end
of the two translations don't correspond, but this might be intended.
For reuse of a translation memory (whether PO compendium, or a TMX file)
we usually use that as part of pot2po (when we migrate current
translations to new translation templates, but we also have a simple
XML-RPC server that can query a translation memory. We use Levenshtein
distance as the metric for finding similar translations, and in our
experience it gives very nice results. One of our Summer of Code
projects will look at some improvements in this area as well. For
reusing terminology, we use another method that tries to do some basic
stemming on the English and then recommend some possibly relevant words.
This is currently only implemented in a user facing way in Pootle.
For those interested, I downloaded an example TMX file that Philippe
gave a link to in IRC, and the toolkit was able to read it fine. The
newest version of the toolkit can also use TMX files directly in
pofilter, even if we can't merge changes back in yet with pomerge yet
(should probably support this quite soon).
F Wolff a écrit :
> On Ma, 2008-05-12 at 23:58 +0200, Philippe wrote:
>> I've made a bunch of perl/php scripts to help on localisation and its
>> QA. I'll present them to you and ask for your test and feedback.
> Hallo Philippe
They give nice results (for small text Levenstein is better, and for
longer text, similar_text give better results).
> One of our Summer of Code
> projects will look at some improvements in this area as well. For
> reusing terminology, we use another method that tries to do some basic
> stemming on the English and then recommend some possibly relevant words.
> This is currently only implemented in a user facing way in Pootle.
I'll be interested to look at this too.
I've downloaded some scientific articles on this topic, but I haven't
got the time to read them.
> For those interested, I downloaded an example TMX file that Philippe
> gave a link to in IRC, and the toolkit was able to read it fine. The
> newest version of the toolkit can also use TMX files directly in
> pofilter, even if we can't merge changes back in yet with pomerge yet
> (should probably support this quite soon).
That's a good information :)
I've also fixed the extra space at the end of each seg in the tmx.