中文健康问句分类与语料构建

郭海红; 李姣; 代涛

文章摘要

郭海红,李姣,代涛.中文健康问句分类与语料构建[J].情报工程,2016,2(6):039-049

中文健康问句分类与语料构建

Question Classification and Corpus Construction of Chinese Health

DOI：10.3772/j.issn.2095-915X.2016.06.005

中文关键词: 健康问句，问句分类，语料构建，公众健康，信息需求

英文关键词: Health questions, question classification, corpus building, consumer health, information needs

基金项目:本文受中国医学科学院中央级公益性科研院所基本科研业务费课题：中文公众健康问句分类与健康信息需求挖掘研究（2016ZX330011），国家社会科学基金资助项目：面向知识服务的健康知识组织体系构建研究（14BTQ032）的资助。

作者	单位
郭海红	中国医学科学院医学信息研究所
李姣	中国医学科学院医学信息研究所
代涛	中国医学科学院医学信息研究所

摘要点击次数: 3673

全文下载次数: 2914

中文摘要:

本文旨在构建一个中文健康问句分类方法，并通过对高血压相关的健康问句进行人工分类标注，分析公众的高血压相关健康信息需求，同时为研发高血压相关的智能中文问答系统提供语料基础。本研究基于临床问句分类及公众健康信息查询场景层次模型，构建一个四级中文健康问句主题分类方法，并由 5 位标注员独立地对从某中文健康网站上收集的将近 10 万条高血压相关提问数据中随机抽取的 2000 条样本数据进行人工分类标注，以优化和测试该问句分类方法的可靠性，构建标注语料库，并分析公众的高血压相关健康信息需求。5 位标注员使用该分类方法进行独立标注的四级类目评判者间信度 kappa 值为 0.63，意味着分类结果可靠，一级大类获得高度一致性（kappa=0.82），略优于国际上的同类研究。分布在治疗、诊断、健康生活方式、临床发现 / 病情管理、流行病学、择医六个一级类别中的问句分别占样本总量的 48.1%、23.8%、11.9%、5.2%、9.0% 和 1.9%。所构建的健康问句分类方法可用于组织大型健康问题集，以提高检索效率；分类标注的样本问句可作为高血压相关健康问句自动分类研究的语料；得出的高血压相关健康问句主题分布有助于指导健康网站的知识资源建设。此外，所设计和采用的问句分类方法构建方式、语料标注流程、评判者间信度测量方法等，也可为开放领域及其他受限领域开展用户问句分类与语料构建提供借鉴。

英文摘要:

This study aimed to build up a Chinese health question classification schema and manually annotate hypertension related health question, so as to understand and specify hypertension related informational needs of the users, and further to lay a corpus foundation for hypertension related smart Chinese question and answering (QA) system. This paper built up a four-level classification schema of health questions based on taxonomies of generic clinical questions and a layered model of context for consumer health information searching. Five annotators independently and manually classified 2000 questions which were randomly selected from nearly 100 thousand hypertension-related messages posted on a Chinese health website to modify and test the reliability of the schema, as well as to build an annotated corpus for Chinese health QA system and to analyze the hypertension related information needs of health consumers. The results showed the kappa statistic for five annotators who independently annotated with the schema on the fourth level was 0.63, indicating "substantial" reliability, and reached “almost perfect” reliability (kappa=0.82) on the first level, which was slightly better than the similar studies oversea. Questions in the categories of treatment, diagnosis, healthy lifestyle, management, epidemiology, and health provider choosing were 48.1%, 23.8%, 11.9%, 5.2%, 9.0%, and 1.9% respectively. This study will do help to organize large collections of health question so as to improve retrieval efficiency, to train machine to automatically classify topics of hypertension related questions posted by health consumers, to guide the building of knowledge base of health websites. Besides, the methods for building the question classification schema, the procedure of corpus annotation, and the methods for evaluating the inter-rater reliability that we designed in this research can provide reference for studies about user question classification and corpus building in open domain and other restricted domain.

查看全文查看/发表评论下载PDF阅读器

关闭