陈翀,高欣妍,黄红.基于 BLSTM-CRF 的自举式术语识别方法研究[J].情报工程,2023,9(5):097-111 |
基于 BLSTM-CRF 的自举式术语识别方法研究 |
A BLSTM-CRF-based Bootstrapping Terminology Recognition Approach Research |
|
DOI:10.3772/j.issn.2095-915X.2023.05.008 |
中文关键词: 术语识别;自举;BLSTM-CRF 模型;识别性能评价;术语质量筛选准则 |
英文关键词: Term Extraction; Bootstrapping, BLSTM-CRF Model; Performance Evaluation of Term Recognition; Term Quality Criterions |
基金项目:富媒体数字出版内容组织与知识服务重点实验室 2022 年度开放基金项目。 |
作者 | 单位 | 陈翀 | 1. 北京师范大学政府管理学院 北京 100875 2. 富媒体数字出版内容组织与知识服务重点实验室 北京 100038 | 高欣妍 | 1. 北京师范大学政府管理学院 北京 100875 | 黄红 | 1. 北京师范大学政府管理学院 北京 100875 |
|
摘要点击次数: 605 |
全文下载次数: 417 |
中文摘要: |
[目的/意义]自动识别优质术语一直是多领域普遍关注的问题,其中一个突出困难是缺乏领域标注语料,为此本文提出一种基于 BLSTM-CRF 的自举式领域术语识别方法。[方法/过程]首先选取少量种子术语标注语料,训练BLSTM-CRF 模型,识别候选术语;再基于术语质量特征构造筛选准则,从候选术语中挑出优质且新增的结果加入到新一轮训练的标注词汇集合,迭代标注训练,直到新增术语量小于某一阈值或迭代达到特定次数。本文还检测了模型迭代训练效率及在其他领域的推广性,将在计算机领域语料训练出的模型用于新兴的融合出版领域的技术术语识别。[局限]术语质量特征量化方法待综合多指标优化,模型改进学习机制未引入负例且迭代不易收敛等。[结果/结论]本文最终通过标注数量和标注语境丰富度实验表明了采用新增标注数据进行迭代的有效性。以 50 轮迭代训练后结果为例,在计算机测试语料上识别术语及其所有标注序列的 F1 值为 0.43 和 0.59,新术语率为 0.79,均优于基准 BLSTM-CRF 模型、BERT-BLSTM-CRF 模型效果,证实了新方法启动成本低,领域适应性好,能够有效解决术语识别中训练语料缺乏的问题。在模型迁移效能评价中,抽样判断的术语识别平均正确率为 87.7%,说明了迁移学习方法的应用潜力。 |
英文摘要: |
[Objective/Significance] Automatic extraction of domain terms has been a research hotspot in the field of information technology. An urgent problem to be solved is the shortage of terms for labeling training corpus, which limits the application of neural network models in domain term extraction. To solve this problem, this paper proposes a BLSTM-CRF-based bootstrapping term recognition approach. [Methods/Processes] First, inputting a small number of seed terms for corpora annotation and training BLSTM-CRF model to identify candidate terms; Then, constructing a set of criterions based on the quality of terms in order to select high-quality new terms from candidate terms, and adding these quality terms to the annotation set for next round training.
Thus, the corpus is relabeled for iteratively model training until the number of new terms is less than a certain threshold or a specific number of iteration rounds is reached. In addition, the model trained on the corpus of the computer science domain can be transferred to recognizing technical terms on the new-emerging domain of fusion-publishing. [Limitations] There are still
issues such as the quantification method of term quality features to be optimized by integrating multiple indicators, the learning mechanism of model improvement does not introduce negative examples and the iteration is not easy to converge, etc. [Results/Conclusions] The decision of iteration approach is supported by the experiments on the amount of annotation and the contextual
richness, which show that the performance of term recognition can be improved when new annotation data increases. Taking the model obtained after 50 rounds of iterative training as an example, the F1 of the recognized terms and all the annotation sequences are 0.43 and 0.59 on the test set of the computer science domain, and the new-term rate is 0.79, which are better than the benchmark BLSTM-CRF model and the BERT-BLSTM-CRF model. It is confirmed that the new method has low starting cost and good domain adaptability, which effectively solves the problem of lack of training corpus in term recognition. In the model transfer efficiency evaluation, the average correct rate on the sampled term recognition is 87.7%, which illustrates the application potential of the transfer learning method. |
查看全文
查看/发表评论 下载PDF阅读器 |
关闭 |