特定課題報告書印刷(Print out of Special Research Projects)

表題番号：2015A-063 日付：2016/03/31

研究課題統計的機械翻訳システムの開発時間の減少：サンプリング手法の検討

	研究者所属（当時）	資格	氏名
（代表者）	理工学術院大学院情報生産システム研究科	教授	ルパージュ　イヴ

研究成果概要: Background: to train a statistical machine translation (SMT) system is time-consuming. In 2013, for the probabilistic approach, a fast alignment method (Fast_align) has been proposed. It is 10 times as fast as the standard method (GIZA++).

Goal: the present research project addressed the problem of reducing the training time of SMT systems for the associative approach 1/ in word-to-word associations (Anymalign) and 2/ in hierarchical sub-sentential alignment (Cutnalign), while increasing translation accuracy.

Method: 1/ for word-to-word association, we studied two improvements in sampling: a/ sampling given the knowledge of a test set to produce ad-hoc translation tables. Two different techniques to estimate inverse translation probabilities have been studied; b/ relying on whether a word is a hapax or not to build and sample sub-corpora. 2/ For sub-sentential alignment, we accelerated decisions in segmentation and reduced the search space. Core components have been re-implemented in C and we introduced multi-processing.

Results: we report improvements in time and translation accuracy using three different language pairs: Spanish-Portuguese, French-English and Finnish-English. Compared to our previous methods, our improved methods increased translation accuracy by one confidence interval in average. Compared with Fast_align, same or lower training times yield similar translation accuracy in the two easiest language pairs.