深度学习在统计机器翻译领域自适应中的应用研究

丁亮; 姚长青; 何彦青; 李辉

文章摘要

丁亮,姚长青,何彦青,李辉.深度学习在统计机器翻译领域自适应中的应用研究[J].情报工程,2017,3(3):064-076

深度学习在统计机器翻译领域自适应中的应用研究

Application of Deep Learning in Statistical Machine Translation Domain Adaptation

DOI：10.3772/j.issn.2095-915X.2017.03.009

中文关键词: 统计机器翻译，训练语料选取，卷积神经网络，深度学习

英文关键词: Statistical machine translation, training data selection, convolutional neural network, deep learning

基金项目:本文受国家自然科学基金项目（61303152、71503240和71403257）和中国科学技术信息研究所重点工作项目（ZD2017-4）的资助。

作者	单位
丁亮	中国科学技术信息研究所,富媒体数字出版内容组织与知识服务重点实验室
姚长青	中国科学技术信息研究所,富媒体数字出版内容组织与知识服务重点实验室
何彦青	中国科学技术信息研究所,富媒体数字出版内容组织与知识服务重点实验室
李辉	北京市科学技术情报研究所

摘要点击次数: 3071

全文下载次数: 1889

中文摘要:

统计机器翻译往往存在待翻译文本来源多样和领域不一致的问题。为了提升面向不同领域的文本的翻译质量，需要根据待翻译文本对训练语料进行筛选以达到领域自适应的目的。目前统计机器翻译的领域自适应方法以目标数据为基准，着重利用统计技术对训练数据或者翻译模型进行领域的适应调整，缺乏明确的领域标签。本研究在本组之前研究基础上利用深度学习中卷积神经网络 (Convolutional neural network, CNN）对短文本进行建模，构建合适的网络结构进行有监督学习，获取完整的句子语义信息，按照待翻译文本的领域信息对训练语料进行归类筛选，获取与待翻译文本领域一致的训练数据，并将其应用到统计机器翻译中。本文采用万方英文摘要在统计机器翻译系统上进行测试，仅利用部分训练数据就得到了超越原始训练数据BLEU 打分的翻译结果，证明了本研究的有效性和可行性。

英文摘要:

Statistical machine translation often meet problems such as the diverse sources of test data and multiple domains. In order to improve the translation quality of texts from different domains, training corpus often needs to be filtered according to target texts to realize domain adaption. The current adaptive methods for statistical machine translation aim to the target texts and focus on the choice of training data and the adjustment of translation models. These approaches have not accuracy and explicit domain label for the texts or data. In this study, we aimed to obtain whole sentence semantic information based on our lab’s pre-research. The short text was modeled by Convolutional Neural Network (CNN), and a suitable network structure was constructed for supervised learning. The training corpus was classified and selected according to the domain information of the test corpus to obtain the part training data same domain as test data. We applied this method to SMT system and test this study on the English abstracts of Wanfang data.The results showed that only part of the training data goes beyond the original training data in BLEU score. This indicated that the method is efficient and feasible.

查看全文查看/发表评论下载PDF阅读器

关闭