Scripts and web pages to help on localisation and QA

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Scripts and web pages to help on localisation and QA

Philippe-3
Hello.
I've made a bunch of perl/php scripts to help on localisation and its
QA. I'll present them to you  and ask for your test and feedback.


The base of these tools is a perl script which take the en-US and
localised dtd and  .properties from cvs and transforms them in a
localisation memory (tmx) file.
This script is put in a cron job every night.

The file format of the translation memory is the xml standard tmx :
http://www.lisa.org/Translation-Memory-e.34.0.html


This tmx is used by php scripts :


* First script :

 From this tmx file the php script :
http://www.frenchmozilla.fr/glossaire/sample.php act as a glossary.
You can search for a word (or multiple words) in the search box.

The script will return two tables (en-US and localised (here french)
matches)  with in the first column a concatenation of the main
directory, the name of the file and the entity. In the second column
it's the en-US string and in the third the localised one.

* Second script :

The second script is more complicated and need  a  pre treatment by a
perl script (run by cron every night).

http://www.frenchmozilla.fr/glossaire/doublons.php

The result of this script will give you for your choice of main
directories ("tout" is for "all" in french) the entities strings which
are the same in en-US but differ for your localisation.

/For example if you compare Browser with Calendar it will return for the
first line of the table :

browser:browser.properties:updatesItem_downloadingFallback

Downloading Update…

Téléchargement d'une mise à jour…

calendar:calendar.properties:updatesItem_downloadingFallback

Téléchargement de la mise à jour…

For the entity updatesItem_downloadingFallback in the files
browser.properties (in browser) and calendar.properties (in calendar)
you have the same english string : "Downloading Update…"
but two different translated strings : "Téléchargement d'une mise à
jour…" and "Téléchargement de la mise à jour…"
/

If you want consistency in your localisation you'll want to eliminate these.

I'll plan (perhaps is someone is interested to help me) to semi
automatise the correction directly from the web page.

If the check box is  checked, the script limits the search on same
entity name, if unchecked the list could be longer and sometimes with
false positives.


* Third script

http://www.frenchmozilla.fr/glossaire/alignement.php

The last php script is a more advanced glossary. It's more an help to
localiser. It try (very simply for now) to make an alignment between a
new string and the ones already localised. ie it search for similarities
between the new entry and the old ones.

You put an en-US string (for example a new entity just landed on the
trunk)  in the search box and it will give you the best matches it can find.

For now it is a very basic search :
- in a first table it give you the perfect matches.
- on the second table it is the almost perfect matches : when the string
can be found whole in an entity.
- the last table will give you the entities where the words of your
search can be found (all the word for now).

In the future I'll try to implement better alignment research.
Some ideas are : research on some of the words only, give the
translation of each words or bunch of words (based on a research on the
corpus from the translation memory). For example in French "bookmarks"
is always translated as "Marque-pages".
Or apply alignment as treated on scientific papers.

Perhaps it's a good idea to include this sort of alignment in a future
localisation tool which will propose new translation for the newly
landed entities on the trunk ?


*************************Feedback, questions?****************

I encourage you to test these scripts and give us feedback.

Do you think they can be useful for you localisation work ?

Do you have some ideas to improve them ?

I know the pages are really ugly, but I'm not a web designer :)

If someone is interested for its locale I can give you the sources and
directions to use it.

We can perhaps also include your localisation from cvs on our server (or
on the l10n server ?).

I will put the source of the perl and php scripts on our server really
soon (I have to put commentaries and do housework on them to eliminate
all the unnecessary stuff).

Hope it will interest you :)


Philippe from the Frenchmozilla team.



_______________________________________________
dev-l10n mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-l10n
Reply | Threaded
Open this post in threaded view
|

Re: Scripts and web pages to help on localisation and QA

F Wolff-2
On Ma, 2008-05-12 at 23:58 +0200, Philippe wrote:
> Hello.
> I've made a bunch of perl/php scripts to help on localisation and its
> QA. I'll present them to you  and ask for your test and feedback.
>

Hallo Philippe

There was some talk in IRC about how this compares to the Translate
Toolkit, so I'll just make a few small notes here, rather than limiting
it to the channel.

We can build a PO compendium with a bash script pocompendium, or a TMX
file with po2tmx:
http://translate.sourceforge.net/wiki/toolkit/pocompendium
http://translate.sourceforge.net/wiki/toolkit/po2tmx
Pocompendium has some advantages in being able to strip accelerators,
for example, and it is very fast as it mostly uses tools from the
gettext package.
po2mtx is limited in some regards (like the handling of duplicates) -
this might actually be the behaviour that some people want.


To find conflicting translations, we use poconflicts:
http://translate.sourceforge.net/wiki/toolkit/poconflicts
It can find both conflicting translations of the same source text, or
identical target translations that correspond to different source texts.
So it can help both with inconsistencies and with newly introduced
ambiguities in the translation (neither of which is necessarily a
problem). Since we run this on our PO files created with the
accelerators merged into the translations, it also (optionally) allows
for checking the consistency of accelerator use.

Other classes of tests are performed with pofilter:
http://translate.sourceforge.net/wiki/toolkit/pofilter
This tests for several things, some important, some more cosmetic that
might be useful to have your attention pointed to. For exapmle, I found
this in the File Philippe linked to:
<tuv xml:lang="en-US"><seg>Bookmark This Link… </seg></tuv>
<tuv xml:lang="fr-FR"><seg>Marquer ce lien </seg></tuv>
<note>endpunc: checks whether punctuation at the end of the strings
match</note></tu>
This just points attention to the fact that the punctuation at the end
of the two translations don't correspond, but this might be intended.


For reuse of a translation memory (whether PO compendium, or a TMX file)
we usually use that as part of pot2po (when we migrate current
translations to new translation templates, but we also have a simple
XML-RPC server that can query a translation memory. We use Levenshtein
distance as the metric for finding similar translations, and in our
experience it gives very nice results. One of our Summer of Code
projects will look at some improvements in this area as well. For
reusing terminology, we use another method that tries to do some basic
stemming on the English and then recommend some possibly relevant words.
This is currently only implemented in a user facing way in Pootle.

For those interested, I downloaded an example TMX file that Philippe
gave a link to in IRC, and the toolkit was able to read it fine. The
newest version of the toolkit can also use TMX files directly in
pofilter, even if we can't merge changes back in yet with pomerge yet
(should probably support this quite soon).

Keep well
Friedel

_______________________________________________
dev-l10n mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-l10n
Reply | Threaded
Open this post in threaded view
|

Re: Scripts and web pages to help on localisation and QA

Philippe-3
In reply to this post by Philippe-3
F Wolff a écrit :
> On Ma, 2008-05-12 at 23:58 +0200, Philippe wrote:
>> Hello.
>> I've made a bunch of perl/php scripts to help on localisation and its
>> QA. I'll present them to you  and ask for your test and feedback.
>>
>
> Hallo Philippe
>

Hallo Friedel.?
[...]

>
> We can build a PO compendium with a bash script pocompendium, or a TMX
> file with po2tmx:
> http://translate.sourceforge.net/wiki/toolkit/pocompendium
> http://translate.sourceforge.net/wiki/toolkit/po2tmx
> Pocompendium has some advantages in being able to strip accelerators,
> for example, and it is very fast as it mostly uses tools from the
> gettext package.
> po2mtx is limited in some regards (like the handling of duplicates) -
> this might actually be the behaviour that some people want.

You just said that you already have build something like mine :)

I think that include accelerators directly into the tmx is something I
can include too.

[...]
> This just points attention to the fact that the punctuation at the end
> of the two translations don't correspond, but this might be intended.

Great idea of testing.
I'll try to implement this also.

>
> [...]

>We use Levenshtein
> distance as the metric for finding similar translations, and in our
> experience it gives very nice results.

I've implemented this metric and the "similar_text" one in the
http://www.frenchmozilla.fr/glossaire/alignement.php script.

They give nice results (for small text Levenstein is better, and for
longer text, similar_text give better results).


> One of our Summer of Code
> projects will look at some improvements in this area as well. For
> reusing terminology, we use another method that tries to do some basic
> stemming on the English and then recommend some possibly relevant words.
> This is currently only implemented in a user facing way in Pootle.

I'll be interested to look at this too.
I've downloaded some scientific articles on this topic, but I haven't
got the time to read them.


>
> For those interested, I downloaded an example TMX file that Philippe
> gave a link to in IRC, and the toolkit was able to read it fine. The
> newest version of the toolkit can also use TMX files directly in
> pofilter, even if we can't merge changes back in yet with pomerge yet
> (should probably support this quite soon).
>

That's a good information :)

I've also fixed the extra space at the end of each seg in the tmx.

Philippe
_______________________________________________
dev-l10n mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-l10n
Reply | Threaded
Open this post in threaded view
|

Re: Scripts and web pages to help on localisation and QA

Arjuna Rao Chavala
Thanks to everyone for  your helpful suggestions.  We will try your
suggestions after the Fx 2 version of our language Telugu is shipped.

Best regards
Arjun,
Firefox telugu localization Team

_______________________________________________
dev-l10n mailing list
[hidden email]
https://lists.mozilla.org/listinfo/dev-l10n