特定課題報告書印刷(Print out of Special Research Projects)

表題番号：2010A-906 日付：2013/05/11

研究課題用例自動翻訳エンジンと実験応用基盤

	研究者所属（当時）	資格	氏名
（代表者）	理工学術院	教授（任期付）	ルパージュ　イヴ

研究成果概要: The final goal of this study is to produce an example-based machine translation engine that can be distributed to the research community on a site dedicated to example-based approaches to machine translation. The engine should use chunks to translate by analogy, and should be made fast by using C implementations of basic computations (resolution of analogical equations). The approach should be tested on various data, like the Europarl data.

1. Work on chunking has been done by implementing two methods: marker-based chunking (Gough and Way, 2004) (255 lines of Python code for chunking) and secability (Chenon, 2005) (170 lines of Python code).
Tests on the Europarl corpus and informal assessment of the relevance of the chunks produced by the two methods has led to prefer the marker-based chunking technique.
In contrast to the standard method proposed by (Gough and Way, 2004), we automatically determine the markers as the most frequent less informative words in a corpus (207 lines of Python code).
The number of markers can be freely chosen by the user.
In contrast to the standard method proposed by (Gough and Way, 2004), we automatically determine whether to cut on the left or on the right of the markers to have a truly language-independent method.
There are still problems on this part of the computation, which is currently done by estimating the difference in entropies on the left and right of each marker.
Improvements are under study.

1.1. We conducted experiments to compute the number of analogies between the chunks obtained (100,000 lines in 11 languages of the Europarl corpus, average sentence length in English: 30 words).
This led to a paper at the Japanese Natural Language Processing Annual Conference (gengosyorigakkai) this year.
My participation to gengosyorigakkai was charged on this budget.

1.2. The production of all chunks for each of the 11 languages of the Europarl corpus (300,000 lines in each language) has been done.
The alignment of chunks by computation of lexical weights is currently being done.
The corresponding programs have been written and tested (136 lines of code in Python).
We determine the most reliable chunk segmentation between two languages by keeping the same average number of chunks for each sentence over the entire corpus.
We are currently in the phase of producing the data.

1.3. Relatively to language models, trigrams and analogy, a connex research will be reported at the French Natural Language Processing Annual Conference on a new smoothing scheme for trigrams. This technique has been shown to beat even Kneser-Ney smoothing on relatively small amounts of corpora: 300,000 lines from the Europarl corpus in all 11 languages except Finnish.

2. The translation engine

2.1. A new engine has been reimplemented in Python (511 lines of code).
Its main feature is the use of threads. to allow concurrent computation of different kinds.
Each of the following task is performed in a different thread:
- generation of analogy equations,
- resolution of analogical equations,
- transfer from source language into target language, and
- linking between source text and translation.
This allows a clearer design.
Work on the design is still in progress.
In particular, the use of UML diagrams for class design allowed to improve the code.
The engine is now in its 3rd version.
Two students are still working on the design of the engine through UML diagrams.
Their part-time job salaries charged on this budget.

2.2. The resolution of analogical equations as a C library has been integrated within the Python translation engine using C/Python SWIG.
The same has been done for the efficient computation of distance or similarity between strings.
The use of the C library leads to an acceleration of 5 to 10 times measured on small examples in formal language theory (translation of the context-free language a^n.b^n n into a regular language (ab)^n).

3. The validation part of the work is ongoing research.
The production of the alignment of chunks in all pairs for the 11 languages of the Europarl corpus is currently being done.
The next step will be systematic assessment of translation by analogy of the chunks in each of these pairs using the standard scripts for assessment with various translation quality metrics: WER, BLEU, NIST and TER.

4. The disclosure of the translation engine on the example-based web site is unfortunately not yet possible. It is hoped that it is made possible in the next few months.