林泽斐,欧石燕.基于在线百科的大规模人物社会网络抽取与分析[J].中国图书馆学报,2019,45(6):100~118
Large Scale People Social Network Extraction and Analysis Based on Online Encyclopedia
基于在线百科的大规模人物社会网络抽取与分析
Received:February 09, 2019  Revised:April 02, 2019
DOI:
Key words:Social network extraction  Social network analysis  People social network  Online encyclopedia  Digital humanities
中文关键词:  社会网络抽取  社会网络分析  人物社会网络  在线百科  数字人文
基金项目:本文系国家社会科学基金重点项目“基于关联数据的学术文献内容语义发布及其应用研究”(编号:17ATQ001)的研究成果之一
Author NameAffiliationE-mail
LIN Zefei 南京大学信息管理学院 江苏 南京 210023  
OU Shiyan 南京大学信息管理学院 江苏 南京 210023 oushiyan@nju.edu.cn,oushiyan@nju.edu.cn 
Hits: 1417
Download times: 578
Abstract:
Social Network Extraction (SNE) is an emerging research field which focuses on automatic extraction of hidden social networks from a wide variety of information sources. The articles of online encyclopedia contain massive information about persons as well as their interpersonal relationships, from which a people social network can be extracted and used for the research of digital humanities and social computing. The extracted people social network involves both real persons who may span thousands of years and virtual persons who may come from a large number of literary works. However, most of people social network extraction methods ignore the types and spatio-temporal characteristics of persons, and only consider text similarity or other related features to measure the degree of relevance between persons. This may result in restrictions on the accuracy and application field of the extracted people social networks.
This study explored the automatic extraction of a large scale people social network from Chinese online encyclopedia for the first time by taking Baidu Encyclopedia as an example. It proposed a new method of social network extraction, which distinguishes the types and spatio temporal characteristics of extracted persons and more accurately measures the weight of interpersonal relationships based on multiple relevance features. This method contains three phrases—generating an initial people social network, computing the relationship strength between different persons and analyzing the spatio temporal characteristics of persons. In the first phase, the articles on persons (hereinafter referred to as “person articles”) were identified from Baidu Encyclopedia, and then an initial undirected and unweighted people social network containing more than 0.54 million nodes and 2.22 million edges were generated based on the links between person articles. In the second phase, the strength of the relationships between persons in the initial network was calculated as a ranking task. It was solved with a supervised learning to rank (L2R) method to combine five similarity features for measuring the relevance degree between persons. Based on this method, the initial unweighted people network was then transformed to a weighted network in which person nodes are across time and space. In the third phase, the living time space of each person in the people network was estimated. For a real person, his/her living time space was estimated based on the years (including reign titles) occurring in the article on him/her, whereas for a virtual person, his/her living time space was one or more works depicting him/her. In this way, a time space coupling network, which contains about 0.45 million nodes and 1.70 million edges, was derived from the previous cross time space weighted people network.
The characteristics of the extracted two people social networks were investigated with social network analysis. The results showed that the two networks were both small world and scale free networks and have a clear community structure. Furthermore, three types of visual analysis were also performed on the two people networks: point analysis was used to detect related persons of a central person; chain analysis was used to discover the path between two persons (i.e. their direct or indirect relationships); and network analysis was used to reveal high central persons and person communities in a specific historical or documental time space. This also indicated that the large scale social networks extracted from online encyclopedia had great value to support digital humanities research and improve researchers perception of historical person age in reality and virtual characters in literary and artistic works. 8 figs. 6 tabs. 39 refs.
中文摘要:
      在线百科词条中蕴含着海量的人物间关系信息,基于这些信息可以抽取出大规模社会网络,为数字人文和社会计算研究提供数据支撑。本研究以百度百科为例,首次对面向中文在线百科的大规模社会网络抽取进行探索,提出一种新的人物社会网络抽取方法。该方法利用排序学习综合多种特征计算人物关系权重,通过估计人物生存时空来发现人物间的时空耦合关系。由此,从百度百科中抽取出一个带权重的跨时空人物社会网络和一个时空耦合的人物网络。这两个人物网络具有良好的小世界和无标度特性,并存在清晰的社区结构。最后,通过可视化分析展示了百科人物网络在数字人文研究中的应用模式和应用价值。图8。表6。参考文献39。
View Full Text   View/Add Comment  Download reader