基于 UIE 系列模型的古籍文本自动标注性能比较研究

王起扬; 刘忠宝

文章摘要

王起扬,刘忠宝.基于 UIE 系列模型的古籍文本自动标注性能比较研究[J].情报工程,2025,(6):028-046

基于 UIE 系列模型的古籍文本自动标注性能比较研究

Comparative Research on Annotation Performance of Ancient Texts Based on UIE Series Models

DOI：

中文关键词: UIE 系列模型；古籍文本；自动标注；比较研究

英文关键词: UIE Series Models; Ancient Texts; Automatic Annotation; Comparative Research

基金项目:国家社会科学基金重点项目“大数据时代古籍活化赋能文化自信自强的理论、方法与路径研究”(23AZD047)。

作者	单位
王起扬	北京语言大学信息科学学院北京 100083
刘忠宝	北京语言大学信息科学学院北京 100084

摘要点击次数: 296

全文下载次数: 302

中文摘要:

[目的/意义]古籍标注是古籍信息处理的基础，传统的人工标注方式费时费力，如何高效、准确地对古籍文本进行自动标注成为了一个亟待解决的问题。利用 UIE 系列模型在文本信息抽取方面的优势，结合古籍文本的特点，引入古籍文本的针对性技术，探究该模型在古籍文本自动标注方面的有效性及差异性。[方法/过程]围绕实体关系抽取与事件论元抽取两类任务进行分析，从准确率、召回率、F1 值等方面对古籍文本的自动标注性能进行比较，以期明晰模型规模、领域适应性以及标注质量之间的关系。[ 结果 / 结论 ]《二十四史》语料集上的实验结果表明，随着模型规模和训练样本规模的增大，UIE 系列模型的标注性能呈上升趋势，当训练样本规模为 2500 例上下时，实体关系抽取和事件论元抽取的 F1 值均达到最优，分别为 72.08% 和 70.57%。

英文摘要:

[Objective/Significance] The annotation of ancient books is the foundation of ancient texts information processing. Traditional manual annotation methods are time-consuming and laborious. How to efficiently and accurately perform automatic annotation of ancient book texts has become an urgent problem to be solved. This article utilizes the advantages of the Unified Information Extraction models in text information extraction, combined with the characteristics and targeted technology of ancient texts, to explore the effectiveness and differences of this model in automatic annotation of ancient texts. [Methods/Processes] This article analyzes two types of tasks, entity relationship extraction and event argument extraction, and compares the automatic annotation performance of ancient text from the aspects of accuracy, recall, F1 value, in order to clarify the relationship between model size, domain adaptability, and annotation quality. [Results/Conclusions] The experimental results on the “Twenty-Four Histories” corpus show that as the model size and training sample size increase, the annotation performance of the UIE series models shows an upward trend. When the training sample size is around 2500, the F1 values for entity relationship extraction and event argument extraction reach their optimal levels, which are 72.08% and 70.57% respectively

查看全文查看/发表评论下载PDF阅读器

关闭