大规模多粒度中文复述语料库

安波

文章摘要

安波.大规模多粒度中文复述语料库[J].情报工程,2022,8(2):019-033

大规模多粒度中文复述语料库

A large Scale Multi-granularity Chinese Paraphrase Corpus

DOI：10.3772/j.issn.2095-915X.2022.02.002

中文关键词: 中文复述；复述识别；复述抽取

英文关键词: Chinese paraphrase; paraphrase detection; paraphrase extraction

基金项目:国家自然科学基金面上项目“知识增强的中文复述识别关键技术研究”（62076233）；社科院 2022 创新工程青年学者资助计划项目（2022MZSQN001）。

作者	单位
安波	中国社会科学院民族学与人类学研究所北京 100081

摘要点击次数: 3804

全文下载次数: 3607

中文摘要:

[ 目的 / 意义 ] 复述是相同语义的不同表达，集中反映了语言的多样性，一直是自然语言处理领域的核心问题。PPDB 英文复述数据集在英文自然语言处理的多种任务中得到了应用，推动了英文自然语言处理领域的发展。缺少大规模多粒度中文复述数据集阻碍了复述技术在中文自然语言处理中的应用，是亟待解决的问题。[ 方法 / 过程 ] 本文实现了一个针对多源数据的复述抽取系统，并抽取构建了一个大规模中文复述数据集，该数据集具有规模大、质量高的特点，且包含复述短语、复述模板和复述句三种粒度的复述文本。[ 结果 / 结论 ] 自动评估和人工评估的结果表明，我们抽取的中文复述数据具有较高的文本多样性和语义一致性。

英文摘要:

[Objective/Significance] Paraphrase is the different expressions of same semantic, which reflects the diversity of languages. It has always been a core issue in the field of natural language processing. PPDB (the paraphrase database) is widely used in many natural language processing task in English, which promotes the developments of English NLP. The lack of largescale multi-granularity Chinese paraphrase datasets hinders the application of paraphrasing technology in Chinese natural language processing, which is an urgent problem to be solved. [Methods/Process] This paper proposes and implements a Chinese paraphrase extraction system for multi-source data, and constructs a large-scale Chinese paraphrase corpus. The corpus has the characteristics of large scale and high quality, and contains paraphrase phrases, paraphrase templates and paraphrase sentences. [Results/Conclusions] The results of automatic evaluation and manual evaluation show that our extracted Chinese paraphrase data has high text diversity and semantic consistency

查看全文查看/发表评论下载PDF阅读器

关闭