表題番号:2015A-063
日付:2016/03/31
研究課題統計的機械翻訳システムの開発時間の減少:サンプリング手法の検討
研究者所属(当時) | 資格 | 氏名 | |
---|---|---|---|
(代表者) | 理工学術院 大学院情報生産システム研究科 | 教授 | ルパージュ イヴ |
- 研究成果概要
- Background: to train a statistical machine translation (SMT) system is time-consuming. In 2013, for the probabilistic approach, a fast alignment method (Fast_align) has been proposed. It is 10 times as fast as the standard method (GIZA++).
Goal: the present research project addressed the problem of reducing the training time of SMT systems for the associative approach 1/ in word-to-word associations (Anymalign) and 2/ in hierarchical sub-sentential alignment (Cutnalign), while increasing translation accuracy.
Method: 1/ for word-to-word association, we studied two improvements in sampling: a/ sampling given the knowledge of a test set to produce ad-hoc translation tables. Two different techniques to estimate inverse translation probabilities have been studied; b/ relying on whether a word is a hapax or not to build and sample sub-corpora. 2/ For sub-sentential alignment, we accelerated decisions in segmentation and reduced the search space. Core components have been re-implemented in C and we introduced multi-processing.
Results: we report improvements in time and translation accuracy using three different language pairs: Spanish-Portuguese, French-English and Finnish-English. Compared to our previous methods, our improved methods increased translation accuracy by one confidence interval in average. Compared with Fast_align, same or lower training times yield similar translation accuracy in the two easiest language pairs.