基于多翻译引擎的汉语复述平行语料构建方法

王雅松; 刘明童; 马彬彬; 张玉洁; 徐金安; 陈钰枫

文章摘要

王雅松,刘明童,马彬彬,张玉洁,徐金安,陈钰枫.基于多翻译引擎的汉语复述平行语料构建方法[J].情报工程,2020,6(5):027-040

基于多翻译引擎的汉语复述平行语料构建方法

The Construction of Chinese Paraphrase Parallel Corpus Based on Multiple Translation Engines

DOI：10.3772/j.issn.2095-915X.2020.05.003

中文关键词: 复述语料构建；汉语复述现象分类；复述生成；多任务学习；自编码任务

英文关键词: Paraphrase corpus construction; Chinese paraphrase phenomenon classification; paraphrase generation; multi-task learning; auto-encoding task

基金项目:国家自然科学基金 (61876198, 61976015, 61370130, 61473294)，北京市自然科学基金 (4172047) 和科学技术部国际科技合作计划 (K11F100010)。

作者	单位
王雅松	北京交通大学计算机与信息技术学院
刘明童	北京交通大学计算机与信息技术学院
马彬彬	北京交通大学计算机与信息技术学院
张玉洁	北京交通大学计算机与信息技术学院
徐金安	北京交通大学计算机与信息技术学院
陈钰枫	北京交通大学计算机与信息技术学院

摘要点击次数: 1747

全文下载次数: 1139

中文摘要:

复述指同一语言内相同意思的不同表达，复述生成指同一种语言内意思相同的不同表达之间的转换，是改进信息检索、机器翻译、自动问答等自然语言处理任务不可或缺的基础技术。目前，复述生成模型性能都依赖于大量平行的复述语料，而很多语言并没有可用的复述资源，使得复述生成任务的研究无法开展。针对复述语料十分匮乏的问题，我们以汉语为研究对象，提出基于多翻译引擎的复述平行语料构建方法，将英语复述平行语料迁移到汉语，构建大规模高质量汉语复述平行语料，同时构建有多个参考复述的汉语复述评测数据集，为汉语复述生成的研究提供一定的基础数据。基于构建的汉语复述语料，我们进一步对汉语复述现象进行总结和归纳，并进行复述生成研究。我们构建基于神经网络编码 - 解码框架的汉语复述生成模型，采用注意力机制、复制机制和覆盖机制解决汉语复述生成中的未登录词和重复生成问题。为了缓解复述语料不足导致的神经网络复述生成模型性能不高的问题，我们引入多任务学习框架，设计联合自编码任务的汉语复述生成模型，通过联合学习自编码任务来增强复述生成编码器语义表示学习能力，提高复述生成质量。我们利用联合自编码任务的复述生成模型进行汉语复述生成实验，在评测指标 ROUGE-1、ROUGE-2、BLEU、METEOR 上以及生成汉语复述实例分析上均取得了较好性能。实验结果表明所构建的汉语复述平行语料可以有效训练复述生成模型，生成高质量的汉语复述句。同时，联合自编码的汉语复述生成模型，可以进一步改进汉语复述生成的质量。

英文摘要:

Paraphrases are sentences or phrases that express the same meaning using different wording. Paraphrase generation refers to generate different expressions with the same meaning in the same language, which is an indispensable basic technology to improve natural language processing tasks such as information retrieval, machine translation, automatic question answering, etc. At present, the performance of the paraphrase generation model relies on a large number of parallel paraphrase corpus. However, there are no available paraphrase resources in many languages, which makes the research on paraphrase generation task impossible. In view of the lack of paraphrase corpus, we propose a method of constructing large-scale and high-quality parallel corpus for Chinese paraphrase based on multiple translation engines to transfer parallel English paraphrase corpus to Chinese. At the same time, Chinese paraphrase evaluation datasets with multiple references are constructed to provide some basic data for the research on the generation of Chinese paraphrase. Based on the constructed Chinese paraphrase corpus, we further summarize and conclude the phenomenon of Chinese paraphrase and make research on paraphrase generation. We construct a Chinese paraphrase generation model based on neural network encoder and decoder framework, which adopt the attention mechanism, copy mechanism and coverage mechanism to solve the problems of unknown words and avoid word repetition in generation. In order to alleviate the problem of low performance of neural network paraphrase model caused by limited paraphrase corpus, we propose a Chinese paraphrase generation model with joint learning auto-encoding task. The model enhances the quality of paraphrase generation by improving the learning ability of encoder. We used the model to conduct the Chinese paraphrase generation experiment, and achieved good performance in both the quantitative evaluation on ROUGE-1, ROUGE-2, BLEU, and METEOR the analysis of generated Chinese paraphrase examples. Experimental results show that the constructed parallel Chinese paraphrase corpus can effectively train the paraphrase generation model and generate high-quality Chinese paraphrase sentences. Meanwhile, the quality of Chinese paraphrase generation can be further improved by Chinese paraphrase generation model with joint learning auto-encoding task.

查看全文查看/发表评论下载PDF阅读器

关闭