基于敦煌古藏文语料库的字词属性统计研究

三智多杰; 祁坤钰; 久仙加

文章摘要

三智多杰,祁坤钰,久仙加.基于敦煌古藏文语料库的字词属性统计研究[J].情报工程,2023,9(2):117-127

基于敦煌古藏文语料库的字词属性统计研究

A Statistical Study of Word Attributes Based on the Dunhuang Ancient Tibetan Corpus

DOI：10.3772/j.issn.2095-915X.2023.02.011

中文关键词: 敦煌古藏文文献；古藏文语料库；字符统计

英文关键词: Dunhuang Ancient Tibetan literature; Corpus of ancient Tibetan; Statistical characters; comparison between ancient and modern Tibetan

基金项目:国家自然科学基金项目敦煌古藏文文献中唐代汉藏文化交流研究（Z21100）；中央高校基本科研业务费专项资金藏语句法树库构建及句法分析模型研究（31920190113）；甘肃省优秀研究生“创新之星”项目大数据背景下敦煌藏文文献语料库字频统计研究（2022CXZX-186）。

作者	单位
三智多杰	1.西北民族大学中国民族信息技术研究院兰州730030
祁坤钰	1.西北民族大学中国民族信息技术研究院兰州730030
久仙加	2.西北民族大学中国寓言文学部兰州 730030

摘要点击次数: 2527

全文下载次数: 2784

中文摘要:

[ 目的 / 意义 ] 古藏文字符统计研究能够对机器翻译，以及从海量文本中快速定位核心内容，对情报收集工作有着重要意义。目前，藏文字符统计研究主要依据现代藏文语料库，忽视了古藏文语料库的字符统计研究。[ 方法 / 过程 ]本文以敦煌藏文文献为主，构建了古藏文文献标注语料库。在此基础上，应用 python 语言设计出古藏文频率统计软件，对古藏文和现代藏文的元音、辅音、藏文音节频次等方面进行对比分析。[ 结果 / 结论 ] 归纳出古藏文字符的分布特征，以期为古藏文标注语料库构建和藏文文字特征研究提供参考。

英文摘要:

[Purpose/significance] The research on the statistics of ancient Tibetan characters is of great significance for machine translation, pinpointing the core content from massive texts, and intelligence collecting. At present, the research on Tibetan character statistics is mainly based on the modern Tibetan corpus, neglecting the character statistics research on the ancient Tibetan corpus. [Method/Process] Based on Dunhuang Tibetan literature, this paper constructs the annotated corpus of ancient Tibetan literature. On this basis, the software of ancient Tibetan frequency statistics is designed by python language, and the vowels, consonants and Tibetan number frequencies of ancient Tibetan and modern Tibetan are compared and analyzed. [Results/ Conclusion] In order to provide reference for the construction of ancient Tibetan annotated corpus, and the study of Tibetan characters, the distribution characteristics of ancient Tibetan characters are summarized as the main content.

查看全文查看/发表评论下载PDF阅读器

关闭