融合BERTopic 和大语言模型的研究前沿识别——以美国NSF人工智能领域资助为例

范旭辉; 穆智蕊

文章摘要

范旭辉,穆智蕊.融合BERTopic 和大语言模型的研究前沿识别——以美国NSF人工智能领域资助为例[J].情报工程,2025,11(1):018-028

融合BERTopic 和大语言模型的研究前沿识别——以美国NSF人工智能领域资助为例

The Research Frontier Identification by Integrating BERTopic and LLMs: Taking the U.S. NSF Funding in the Field of Artificial Intelligence as an Example

DOI：

中文关键词: 研究前沿；前沿识别；BERTopic；大语言模型；主题建模；科研基金

英文关键词: Research Frontiers; Frontier Identification; BERTopic; Large Language Models; Topic Modeling; Research Funding

基金项目:中国工程科技知识中心项目“数据资源采集与加工”（CKCEST-2023-1-3），中国工程院战略研究与咨询项目“2024 年度全球工程前沿研究”（2024-XBZD-20）。

作者	单位
范旭辉	中国工程院战略咨询中心北京 100088
穆智蕊	中国工程院战略咨询中心北京 100088

摘要点击次数: 1836

全文下载次数: 2340

中文摘要:

[目的/意义] 为了解决研究前沿识别中主题建模缺少语义化的主题表达、基于关键词和人工判别的主题命名较为主观、未考虑主题相关文档内容等问题，引入大语言模型对生成的研究主题进行语义增强，以提高研究前沿识别的准确性和客观性。[方法/过程] 首先梳理了研究前沿的相关概念，以及主要识别理论、方法，然后以基金项目为数据源，通过BERTopic 进行主题识别，使用大语言模型进行主题命名，识别出了隐含在基金项目中的研究主题，提出了研究前沿测度指标体系，并使用Critic 客观赋权法确定了指标权重。[局限] 大模型生成的主题短语部分内容语义较为模糊，未能使用业领域内的术语来表达，且研究所使用的数据仅限于基金项目数据。[ 结果/ 结论] 以人工智能领域NSF 资助的科研项目为例，识别出了机器人、机器学习算法、智能教育、数据管理、模拟仿真等研究前沿，通过与美国人工智能规划及技术评估和预测报告的内容进行对比后发现，识别出的研究前沿具有一定的合理性和前瞻性。

英文摘要:

[Objective/Significance] The relationship between technology and the nodes of the publishing industry chain is ofsignificant importance for constructing the technological spectrum of the publishing industry and monitoring its development.[Methods/Processes] This article designs the industrial chain and the technological spectrum for both traditional publishing and digital publishing. The design of the industrial technological spectrum includes six dimensions, industrial segments, industrial terms, technical terms, participating entities, and product services. After obtaining the entities of the technological spectrum in publishing industry, relationship templates between entities are acquired using syntactic dependency analysis tools. Then,a relationship extraction model based on the quality of relationship templates is implemented using the Mean Teacher deep learning framework and BiGRU+Attention neural network encoder. Furthermore, a semi-supervised deep learning method with partially manually annotated data is employed for relationship classification model training based on relationship template classification. [Limitations] The future research work is still needed on how to improve the accuracy of identifying relationship types in relationship templates and enhance the performance of models by improving deep learning model frameworks. [Results/Conclusions] Experimental results indicate that this model achieves 66% accuracy in actual corpus texts, and categorizing templates can lead to a 1% increase in accuracy.

查看全文查看/发表评论下载PDF阅读器

关闭