Surveying word boundary factor in Chinese-Vietnamese statistical machine translation

  • Trần Thanh Phước
  • Đinh Điền

Abstract

In isolating languages such as Chinese and Vietnamese, words are not separated by spaces, a word can include one or more spelling words. Segmenting word or not before training and translating process is a problem that need to be considered. In this paper, we will survey the effect of word boundary factor in the translation result of Chinese-Vietnamese statistical machine translation (SMT). The experimental result of this paper will be the basis for word segmentation improvement in future research which increase machine translation performance. We surveyed on two experiments: word segmentation (WS) and word un-segmentation (WUS) on the corpus of 8,000 and 12,000 sentence pairs. Based on the experimental results, we found that both of WS corpus and WUS corpus have their own advantages and defects. We propose integrating the advantages of these two methods in SMT.

điểm /   đánh giá
Published
2016-04-13
Section
ARTILES