许海云,武华维,罗瑞,董坤,李婧.基于多元关系融合的科技文本主题识别方法研究[J].中国图书馆学报,2019,45(1):82~94
Topic Identification Based on Multi Semantic Relation Fusion
基于多元关系融合的科技文本主题识别方法研究
Received:May 12, 2018  Revised:October 01, 2018
DOI:
Key words:Topic recognition based on text  Multiple relations  Data fusion  Relational fusion  Topic clustering
中文关键词:  文本主题识别  多元关系  数据融合  关系融合  主题聚类
基金项目:本文系国家自然科学基金项目“基于科学—技术主题关联分析的创新演化路径识别方法研究(编号:71704170)”,中国科学院知识产权信息服务专项“面向干细胞领域知识发现的科研信息化应用”(编号:KFJ EW STS-032)研究成果之一
Author NameAffiliationE-mail
XU Haiyun 中国科学院成都文献情报中心中国科学院大学 四川 成都 610041 xuhy@clas.ac.cn,xuhy@clas.ac.cn 
WU Huawei 中国科学院成都文献情报中心中国科学院大学 四川 成都 610041  
LUO Rui 山东理工大学科技信息研究所 山东 淄博 255200  
DONG Kun 中国科学院成都文献情报中心中国科学院大学 四川 成都 610041  
LI Jing 中国科学院成都文献情报中心中国科学院大学 四川 成都 610041  
Hits: 2368
Download times: 1138
Abstract:
One of the typical characteristics of big data analysis is multivariate data relation processing. The multi relationship analysis of topics refers to the analysis of the relationships established between topics and other measurable entities (MEs). There are many MEs in scientific or technological documents, and they relate directly or indirectly with knowledge units. However, the current topic acquisition methods for this document rely mostly on single association analysis, so it is difficult to obtain the topics of scientific or technological developments accurately. Therefore, finding the multi relationships between the entities of a document is one of the key technologies for accurate topic identification in massive scientific or technological literatures.
This paper firstly reviewed the research status of multi relations fusion in topic identification, summarized the various measurable relationships of topic terms in the scientific or technical literatures. The research found that there are semantic relations between topic terms, authors and citations in the scientific or technical literatures based on the topic content, with their co occurrence relations can reveal respectively the topic association from different perspectives. Based on the distance between the semantic distances of topic terms, we divided the topic terms associations in topic identification into basic relations, strengthened relations and additional relations. For the strengthened relations and additional relations, any type of MEs can be the intermediate node of the topic terms association. Choosing the appropriate intermediate MEs is especially important for fully establishing the semantic association between the topic terms. This paper chooses authors, references and citation literatures as the intermediate MEs of topic term strengthened relations and additional relations. Seven types of topic associations are formed by the topic terms and these MEs. The fusion relationship can make up for the lack of information of a single association relationship through obtaining more accurate topic association.
The acquisition of multiple topic associations is the basis of multi relations fusion. Whether the multi relations fusion algorithm can enhance the meaningful topic semantic association and weaken the noise correlation is also an important step to achieve multi relationship topic clustering. This study gives a calculation method of both direct and indirect association weights of MEs with reference to Morris's definition of association weights of multi relational MEs. Finally, a multi relationship extraction and relationship fusion method for topic identification is proposed. Finally, this paper took genetic engineering vaccine as experimental field, through relational matrix acquisition algorithm proposed by self programming, seven types of topic correlation matrices were extracted and the correlation association matrices were realized by PathSelClus algorithm. With a comparative analysis, it proves that multi relations fusion can effectively improve the effect of topic clustering.
The PathSelClus relationship fusion used in this paper is merely one of various existing multi relations fusion methods, and it is highly dependent on expert knowledge. The quality of the annotation results directly affects the clustering results, and there is no effective way to determine the number of clusters. We think that much work needs to be done to further the study in the future, such as, how is the effect of other fusion methods working in topic identification from text? What will the performance comparison be? At the same time, according the research objective, we will explore more fusion methods and integrate them to obtain fusion results with more information. 4 figs. 6 tabs. 19 refs.
中文摘要:
      当前文本主题获取方法大多依靠单一关联分析,不能全面分析可获取信息,难以准确获取科技发展主题。科技文献的主题词、作者和引文之间蕴含了以研究主题内容为纽带的语义关联关系,主题词共现关系、引文关系和合著关系分别从不同的角度展现了主题关联关系。因此,本文根据主题词之间语义关系距离的远近,将主题识别中主题词关联分为基础关系、强化关系和新增关系,在此基础上提出面向主题识别的多元关系抽取及关系融合方法;并以基因工程疫苗的研发与制备领域为例进行领域实证分析,利用PathSelClus算法实现基于多元关系融合的主题聚类,通过对比实验证明多元关系融合可以有效提高实证领域的文本主题聚类效果,而未来多关系融合主题识别则是需要重点关注的问题。图4。表6。参考文献19。
View Full Text   View/Add Comment  Download reader