表題番号:2015A-063 日付:2016/03/31
研究課題統計的機械翻訳システムの開発時間の減少:サンプリング手法の検討
研究者所属(当時) 資格 氏名
(代表者) 理工学術院 大学院情報生産システム研究科 教授 ルパージュ イヴ
研究成果概要
Background: to train a statistical machine translation (SMT) system is time-consuming.  In 2013, for the probabilistic approach, a fast alignment method (Fast_align) has been proposed. It is 10 times as fast as the standard method (GIZA++).

Goal: the present research project addressed the problem of reducing the training time of SMT systems for the associative approach 1/ in word-to-word associations (Anymalign) and 2/ in hierarchical sub-sentential alignment (Cutnalign), while increasing translation accuracy.

Method: 1/ for word-to-word association, we studied two improvements in sampling: a/ sampling given the knowledge of a test set to produce ad-hoc translation tables. Two different techniques to estimate inverse translation probabilities have been studied; b/ relying on whether a word is a hapax or not to build and sample sub-corpora. 2/ For sub-sentential alignment, we accelerated decisions in segmentation and reduced the search space. Core components have been re-implemented in C and we introduced multi-processing.

Results: we report improvements in time and translation accuracy using three different language pairs: Spanish-Portuguese, French-English and Finnish-English. Compared to our previous methods, our improved methods increased translation accuracy by one confidence interval in average. Compared with Fast_align, same or lower training times yield similar translation accuracy in the two easiest language pairs.