张卫,王昊,邓三鸿,张宝隆.面向数字人文的古诗文本情感术语抽取与应用研究[J].中国图书馆学报,2021,47(4):113~131
Sentiment Term Extraction and Application of Chinese Ancient Poetry Text for Digital Humanities
面向数字人文的古诗文本情感术语抽取与应用研究
Received:August 02, 2020  Revised:November 07, 2020
DOI:
Key words:Digital humanities  Ancient poetry  Sentiment term extraction  Chinese character linguistics featureChar2Vec  BERT
中文关键词:  数字人文  古诗  情感术语抽取  汉字语言特征  Char2Vec  BERT
基金项目:本文系国家自然科学基金面上项目“关联数据驱动下我国非遗文本的语义解析与人文计算研究”(编号:72074108)和中央高校基本科研业务费项目“面向人文计算的方志文本的语义分析和知识图谱研究”(编号:010814370113)的研究成果之一
Author NameAffiliation
ZHANG Wei 南京大学信息管理学院 江苏 南京 210023 
WANG Hao 南京大学信息管理学院 江苏 南京 210023 
DENG Sanhong 南京大学信息管理学院 江苏 南京 210023 
ZHANG Baolong 南京大学信息管理学院 江苏 南京 210023 
Hits: 846
Download times: 626
Abstract:
In recent years,the application of digital technologies such as digital library,information visualization,multimedia publishing,and geographic information system in the field of humanities has made the scope of digital humanities more extensive It is of great significance to regain the “humanity” and “computability” characteristics of discipline by taking key technology to parse sentiment knowledge in humanistic objects Ancient poetry contains the sentiment knowledge about the political background,historical events,folk customs,and other things,and occupies an extremely important literary position At present,existing studies mainly automatically classified the sentiment trend of ancient poetry,while the mining of fine grained sentiment knowledge is still insufficient In order to achieve a more accurate sentiment analysis of ancient poetry,this paper focuses on the sentiment term automatic extraction in the domain text and its application.
Firstly,this paper introduces the modern appreciation to extend the sentiment knowledge and human connotations in the target ancient poetry for the first time to solve the problems of limited term number,coarse sentiment granularity and insufficient learning of text feature caused by the condensed language characteristic of ancient poetry; secondly,a “cold start” automatic citation method for character sequences is proposed to obtain learning corpus; thirdly,based on a character vector mapping (Char2Vec) of the BERT model,we focus on exploring the extraction effect of sentiment terms under the introduction of Chinese character linguistics feature on the benchmark of CRFs model,and compare it with the BERT-BiLSTM-CRFs model,and then we define a new term recognition rule from the view of knowledge discovery; finally,based on the term set of humanistic sentiment in the field of ancient poetry,this paper explores the digital application of term retrieval,granularity mining and poet portrait It was found that:1)The integration of modern appreciation into ancient poetry significantly optimizes the breadth and depth of sentiment knowledge,and the domain terms were effectively labeled by the method proposed in this paper 2) The trained BERT-BiLSTM-CRFs model outperformed the CRFs model,and the best F1 and F1_distinct can reach 9563% and 8543% At the same time,the introduction of Chinese character features also improves the effect of traditional CRFs,among which the field feature and the constraint radical feature (“shuxinpang” and “xinzidi”) are optimal 3) Compared with the long new terms extracted by machine learning,deep learning expands more new imagery words that deliver sentiment The latter integrates with domain terms to form 14 599 distinctive terms and lays the foundation for the construction of sentiment dictionary and sentiment analysis in the field of ancient poetry.
The contribution of this paper lies in two aspects The sentiment term derived from poetry and appreciation provides a reference for sentiment analysis and knowledge service of literary information resources (humanity),and the extraction scheme based on the linguistic knowledge provides inspiration for the deepening of natural language processing technology in the Chinese domain (computability) 11 figs 6 tabs 30 refs.
中文摘要:
      在跨学科知识范式下,数字人文的研究范畴随着自身学科体系的拓展而不断泛化,采取关键语义技术解析文化对象中的人文内涵与情感知识对于重拾学科“人文性”与“计算性”特质具有重要意义。本文以古诗文本为例,面向汉语诗文及其鉴赏实现大规模人文情感术语的自动化抽取与分析。首先在无标注集环境下提出一种基于“冷启动”的字序列自动标引方法来获取学习语料,随后在字向量(Char2Vec)指导下将汉字特征(部首、拼音等)和BERT语言学模型分别引入机器学习与深度学习模型,并从知识发现的角度定义新术语识别规则。研究发现,将现代鉴赏融入古诗原文显著优化了情感知识的广度与深度,领域术语能够被有效标引。训练的BERT-BiLSTM-CRFs深度学习模型的效果明显优于CRFs机器学习,最佳F1与F1_distinct可分别达到9563%和8543%;同时汉字特征的引入也有效提升了传统CRFs效果,以领域特征和基于“竖心旁”“心字底”部首约束特征为最优。相较于机器学习抽取出的长篇幅新术语,深度学习能够拓展出更多寄托情感知识的新意象词。源于诗文与鉴赏的情感术语为文学信息资源的情感分析与知识服务提供了参考(人文性),基于汉字语言特征的抽取方案为中文领域自然语言处理技术的深化提供了启迪(计算性)。图11。表6。参考文献30。
View Full Text   View/Add Comment  Download reader