郭红梅,张智雄.基于图挖掘的文本主题识别方法研究综述[J].中国图书馆学报,2015,41(6):97~108
Methods of Text Theme Identification Based on Graph Mining
基于图挖掘的文本主题识别方法研究综述
Received:July 31, 2015  
DOI:10.13530/j.cnki.jlis.156008
Key words:Text theme identification  Graph mining  Centrality  Clique sub-group
中文关键词:  文本主题识别  图挖掘  中心度  Clique子团
基金项目:本文系国家自然科学基金项目“基于语言网络的文本主题中心度计算方法研究”(编号:61075047)的研究成果之一
Author NameAffiliationE-mail
GUO Hongmei 中国科学院 北京 100190 zhangzhx@mail.las.ac.cn 
ZHANG Zhixiong 中国科学院 北京 100190  
Hits: 3097
Download times: 2023
Abstract:
With the development of the internet, electronic text is booming. These text resources, especially scientific journal papers, contain rich semantic and linked information. How to demonstrate the core topics quickly and accurately to assist researchers and improve research efficiency has been an urgent issue in text mining. Nodes and edges of graph can represent terms and their relations of texts, so many researchers tried to combine graph mining with natural language processing to identify text theme. This paper investigated and analyzed the studies and summarized their advantages and disadvantages in order to provide a reference for further research.
At present, the studies focus on textual representation of relation graph, theme identification based on centrality and subgraph detection or clustering. The method of theme identification based on cohesive subgraph detection mainly is to recognize clique or quasi clique subgraph to represent the core content of the texts. Theme identification based on graph mining uses two methods: one is according to the graph topological structure, and the other considers graph topological structure and node attributes simultaneously. We mainly analyzed the clustering model, algorithm and evaluation criterion of clustering result. The methods of frequency statistics and external dictionary are relatively mature and often used as benchmark. Centrality methods have been greatly improved, but the algorithm efficiency still needs to be improved. The methods based on graph mining have already shown advantages and are worth deeper exploration.
Language network of text has its unique characteristics. Various relations exist between terms, for example, co-occurrence relation, syntactic relation and semantic relation. How to construct complex text network which can reveal the relations of terms at the same time is one of the research directions in the future. Further studies need to address how to identify cohesive subgraph in complex text network according to relations between terms and topological structure of graph. In addition, the measure according to which these subgraphs are clustered to reveal core sub-themes and the relations of themes in texts also needs to be discussed. 1 tab. 50 refs.
中文摘要:
      本文通过文献调研分析,将基于图挖掘的文本主题识别方法总结为中心度方法、紧密关联子图查找和图聚类三种,后两者又细分为基于clique子团或类clique子团、基于图拓扑结构或结点属性聚类的方法。中心度方法通过对比文本网络中术语结点的重要度来实现文本主题的识别,紧密关联子图查找和图聚类方法则是根据文本图中术语结点和边的属性相似度来识别文本核心主题。基于语言文本网络自身特性,如何构建复杂文本关系图来同时揭示术语间的句法、共现和语义关系,如何基于术语关联和图拓扑结构识别其中的紧密关联子团,基于何种标准将紧密关联子团聚类以揭示文本核心主题,都是未来需要进一步深入研究的问题。表1。
View Full Text   View/Add Comment  Download reader