谢林蕾,向熠,章成志.面向融合出版前沿主题发现的学术论文未来工作句挖掘研究[J].情报工程,2023,9(5):123-138 |
面向融合出版前沿主题发现的学术论文未来工作句挖掘研究 |
Research on Future Working Sentences Mining of Academic Papers for Frontier Topics Discovery of Integrated Publishing |
|
DOI:10.3772/j.issn.2095-915X.2023.05.010 |
中文关键词: 融合出版;未来工作句;机器学习;文本分类;内容分析 |
英文关键词: Integrated Publishing; Future Work Sentence; Machine Learning; Text Classification; Content Analysis |
基金项目:2022 年富媒体数字出版内容组织与知识服务重点实验室开放基金项目“面向融合出版的前沿技术发现研究”(zd2022-10/02)。 |
作者 | 单位 | 谢林蕾 | 1. 南京理工大学经济管理学院信息管理系 南京 210094 | 向熠 | 1. 南京理工大学经济管理学院信息管理系 南京 210094 | 章成志 | 1. 南京理工大学经济管理学院信息管理系 南京 210094;2. 富媒体数字出版内容组织与知识服务重点实验室 北京 100038 |
|
摘要点击次数: 530 |
全文下载次数: 484 |
中文摘要: |
[目的/意义]近年来,随着传统出版与数字出版的不断融合,形成了融合出版的新兴范式。如何科学准确地把握融合出版领域未来研究趋势具有重要研究意义。学术论文中描述未来研究工作的句子(简称“未来工作句”),不但
可以辅助预测未来可能出现的前沿主题,还可为科研工作者、特别是初学者选题提供参考。[方法/过程]对融合出版领域论文中的未来工作句进行人工标注和类别划分,构建未来工作句识别与分类语料库。在此基础上,使用支持向量
机、朴素贝叶斯和随机森林三种模型结合 SelectKBest 特征选择方法,来训练未来工作句自动识别模型。[结果/结论]LinearSVC 在未来工作句自动识别任务中表现最为出色,其加权 F1 值达到 92.08%。另外,本文对分类语料库中的未来工作句内容及其类别进行分析,得到融合出版领域未来工作句的类别分布及其变化规律。 |
英文摘要: |
[Objective/Significance] In recent years, the continuous integration of traditional publishing and digital publishing has given birth to an emerging paradigm known as integrated publishing. Understanding the future research trends in this field is of great research significance. The sentences in academic papers that describe future research work, also known as “future working
sentences”, can not only predict potential future topics, but also guide researchers, especially beginners, in selecting their topics. [Methods/Processes] This study carries out artificial labeling and classification of future working sentences from papers related to integrated publishing and constructs a corpus for recognizing and classifying future working sentences. Subsequently, three models - Support Vector Machine, Naïve Bayes, and Random Forest – combined with the SelectKBest feature selection method are utilized to train an automated recognition model for future working sentences. [Results/Conclusions] The experimental results indicate that LinearSVC offers the best performance in the task of automatic recognition of future working sentences, achieving a weighted F1 value of 92.08%. Furthermore, this paper analyzes the content and categories of future working sentences in the taxonomic corpus, revealing the category distribution and trends of such sentences within the field of integrated publishing. |
查看全文
查看/发表评论 下载PDF阅读器 |
关闭 |