文章摘要
魏超,赵伟,王弋波,吴欣雨.基于局部线性自编码的政策文本挖掘方法研究[J].情报工程,2025,11(3):027-040
基于局部线性自编码的政策文本挖掘方法研究
Policy Text Mining Method Based on Local Linear AutoEncoder
  
DOI:
中文关键词: 政策文本挖掘;自编码网络;局部线性;政策聚类和分类
英文关键词: Policy Text Mining; Autoencoder Networks; Local Linearity; Policy Clustering and Classification
基金项目:中国科学技术信息研究所重点工作项目“面向战略研究的智能情报分析平台建设”(ZD2024-01)。
作者单位
魏超 中国科学技术信息研究所 北京 100038 
赵伟 中国科学技术信息研究所 北京 100038 
王弋波 中国科学技术信息研究所 北京 100038 
吴欣雨 中国科学技术信息研究所 北京 100038 
摘要点击次数: 47
全文下载次数: 29
中文摘要:
      [目的/意义] 在政策文本数据量激增和结构日益复杂的背景下,针对政策文本复杂度高、领域性强、术语专业的特点,当前基于全局特征表示的方法在语义细节捕捉与跨领域泛化方面存在不足。[方法/过程]从局部视角出发,提出一种基于局部线性自编码的政策文本挖掘方法,通过提取局部平滑的政策文本低维表示来改善政策聚类和分类效果。首先利用BERT 表示的余弦距离确定局部政策文本近邻,然后通过联合最小化局部政策文本近邻的加权均方误差和重构误差来训练自动编码器。[局限]主要针对中文政策数据,对于不同语言政策数据适用性存在局限。[结果/结论]相较于对比方法,所提方法在谱聚类达到85.78,提升了10%,在SVM分类算法方面平均值提高到86.88%,提升了7.5%。结果表明,该方法可以为政策文本提供更具辨识力的低维表示,从而提升政策聚类、分类任务效果。
英文摘要:
      [Objective/Significance] In the context of the surge in the amount of policy text data and the increasing complexity of its structure, for the characteristics of high complexity, strong domain and specialized terminology of policy text, the current methods based on global feature representation are deficient in semantic detail capture and cross-domain generalization. [Methods/Processes] From a local perspective, a policy text mining method based on local linear autoencoder is proposed to improve the policy clustering and classification effects by extracting local smooth low-dimensional representation of policy text. First, the cosine distance represented by BERT is used to determine the local policy text neighbors, and then the autoencoder is trained by jointly minimizing the weighted mean square error and reconstruction error of the local policy text neighbors. [Limitations] It is mainly aimed at Chinese policy data, and has limitations in applicability to policy data in different languages. [Results/Conclusions] Compared with the comparison method, the proposed method achieves 85.78 in spectral clustering, an improvement of 10%, and the average value in SVM classification algorithm is improved to 86.88%, an improvement of 7.5%. The results show that the proposed method can provide a more discernible low-dimensional representation for policy texts, thereby improving the performance of policy clustering and classification tasks.
查看全文   查看/发表评论  下载PDF阅读器
关闭

分享按钮