基于最大熵模型的学术缩写自动识别

张秋子; 陆伟; 程齐凯; 黄永

文章摘要

张秋子,陆伟,程齐凯,黄永.基于最大熵模型的学术缩写自动识别[J].情报工程,2015,1(2):064-072

基于最大熵模型的学术缩写自动识别

Study on Automatic Identification of Academic Abbreviations and theirDefinitions based on Maximum Entropy Model

DOI：10.3772/j.issn.2095-915X.2015.02.008

中文关键词: 学术文本，缩写，机器学习，序列标注，信息抽取

英文关键词: Academic texts, abbreviations/acronyms, machine learning sequence ,labelling, information extraction

基金项目:“基于语言模型的通用实体检索建模及框架实现研究”（项目编号：71173164）

作者	单位
张秋子	武汉大学信息资源研究中心
陆伟	武汉大学信息资源研究中心
程齐凯	武汉大学信息资源研究中心
黄永	武汉大学信息资源研究中心

摘要点击次数: 4124

全文下载次数: 3166

中文摘要:

为实现海量英文学术文本中缩写词及对应缩写定义的识别，本文提出了一种自动缩写识别算法 MELearn-AI。该算法在人工标注数据集的基础上，从序列标注的角度，通过最大熵模型实现了计算机领域英文学术文本中的自动缩写识别。MELearn-AI 在本文构建的评测数据集“Paren-sen”上得到了95.8% 的查准率和86.3% 的查全率，相对于其他两组对照实验的效果有较为明显的提升。本文提出的自动缩写识别方法能够在计算机领域的学术文本上取得令人满意的效果，有助于更好地理解并利用该领域术语。

英文摘要:

In order to effectively identify the abbreviations and their corresponding definitions from enormous English academic texts, this paper proposes an automatic identification algorithm called MELearn-AI.In the perspective of the sequence labelling,MELearn-AI utilizes a manually labelled dataset and adopts maximum entropy algorithm to train a model, and then identify abbreviations in computer science academic texts based on the model. This method achieves a 95.8% precision rate with a 86.3% recall rate in the "Paren-sen" evaluation dataset created in this paper,it shows an obvious improvement compared to the other two algorithms.This paper proposes a method to identify the abbreviations and their corresponding definitions.Tested in English academic texts of computer science, the algorithm achieves satisfactory results, which is helpful to better understanding and adopting the terminology of this field.

查看全文查看/发表评论下载PDF阅读器

关闭