大规模文本分类的训练语料去噪方法研究

高雄; 韩红旗; 王力; 薛陕

文章摘要

高雄,韩红旗,王力,薛陕.大规模文本分类的训练语料去噪方法研究[J].情报工程,2021,7(4):117-126

大规模文本分类的训练语料去噪方法研究

Research on Denoising Method of Training Corpus for Large-scale Text Classification

DOI：10.3772/j.issn.2095-915X.2021.04.010

中文关键词: 文本分类；去噪；词嵌入

英文关键词: Text classification; denoising;word embedding

基金项目:中国科学技术信息研究所创新研究基金青年项目“大规模文本分类的训练语料去噪研究”（QN2020-10）；中国工程科技知识中心建设项目“知识组织体系建设”（CKCEST-2021-2-6）。

作者	单位
高雄	1.中国科学技术信息研究所北京 100038；2.富媒体数字出版内容组织与知识服务重点实验室北京 100038
韩红旗	1.中国科学技术信息研究所北京 100038；2.富媒体数字出版内容组织与知识服务重点实验室北京 100038
王力	1.中国科学技术信息研究所北京 100038；2.富媒体数字出版内容组织与知识服务重点实验室北京 100038
薛陕	1.中国科学技术信息研究所北京 100038；2.富媒体数字出版内容组织与知识服务重点实验室北京 100038

摘要点击次数: 3411

全文下载次数: 3756

中文摘要:

[ 目的/ 意义] 训练语料的质量对主流的文本分类算法至关重要。消除噪声，尤其是“类别外噪声”，有助于提升训练语料的质量，进而提升文本分类算法的准确率。[ 方法/ 过程] 本文重点利用语义信息来消除“类别外噪声”。通过对每个类别的训练语料构建“类目- 类目关键词”知识库，利用“词嵌入”自动化比较其语义信息来判断该类别下是否存在噪声，并给出“类别外噪声”类目候选列表以及文献候选列表，最后通过人机交互的方式消除噪声。[ 结果/ 结论] 本文提出的去噪方法能够有效检测并消除大规模文本分类的训练语料中的噪声数据，提升训练语料的质量。

英文摘要:

[Objective/ Significance] The quality of the training corpus is very important to mainstream text classification algorithms. Denoising, especially “out-of-category noise”, helps to improve the quality of the training corpus, thereby improving the accuracy of the text classification algorithm. [Methods/Process] This paper focuses on using semantic information to eliminate“out-of-category noise”, builds a “category-category keywords” knowledge base for each category of training corpus, and uses “word embedding” to automatically compare its semantic information to determine whether there is noise in the category.And give a list of candidates for noise category and noise data, and finally eliminate the noise by means of human-computer interaction. [Results Conclusions] The denoising method proposed in this paper can effectively detect and eliminate the noise data in the training corpus of large-scale text classification, and improve the quality of the training corpus.

查看全文查看/发表评论下载PDF阅读器

关闭