王芳,赵洪,马嘉悦,李晓阳,张晓玥.数据科学视角下数据溯源研究与实践进展[J].中国图书馆学报,2019,45(5):79~100
Research and Practice Progress of Data Provenance from the Perspective of Data Science
数据科学视角下数据溯源研究与实践进展
Received:June 25, 2019  Revised:August 20, 2019
DOI:
Key words:Data science  Data provenance  Blockchain  Data quality  Big data platform
中文关键词:  数据科学  数据溯源  区块链  数据质量  大数据平台
基金项目:本文系国家社会科学基金重大项目“我国网络社会治理研究”( 编号: 14ZDA063) 与提升政府治理能力大数据应用技术国家工程实验室开放基金重点支持项目“基于NLP和深度学习的大规模政府公文智能处理技术研究”的研究成果之一
Author NameAffiliationE-mail
WANG Fang 南开大学商学院信息资源管理系 天津300071 wangfangnk@nankai.edu.cn 
ZHAO Hong 南开大学商学院信息资源管理系 天津300071  
MA Jiayue 南开大学商学院信息资源管理系 天津300071  
LI Xiaoyang 南开大学商学院信息资源管理系 天津300071  
ZHANG Xiaoyue 南开大学商学院信息资源管理系 天津300071  
Hits: 1891
Download times: 2355
Abstract:
In the data age, authenticity and reliability are the fundamental requirements of data in many fields. It is of great research value and practical significance to realize data quality control and reliable management through data provenance. Data provenance is not only a technical problem but also a management problem. It should be paid more attention to by scholars in the field of data science and information resources management.
Data provenance is widely applied in scientific data curation, e commerce, food safety, culture and art, medical treatment, digital library, electronic document management and many other fields, and a lot of studies on it have been conducted. From the perspective of data science, this paper reviews the research and practice progress of data provenance based on 136 domestic and foreign research papers. On the basis of reviewing the concepts, models, computation methods and practical applications of data provenance, this paper introduces the related studies and practice in the field of information resources management. Finally, the future research trends on data provenance are discussed.
This paper systematically combs the development of the concept of data provenance and introduces five types of data provenance models according to their function level and application characteristics in data management, including information description model, general expression model, domain application model, safety management model and block chain provenance management model. Model is the abstract representation framework of the strategy and process of data provenance, while computation is the technique and algorithm of its implementation. The computation of data provenance can be divided into two basic ideas: tag based provenance and non tag based provenance. Some specific computing methods have been developed for different application scenarios. This paper mainly introduces the computing methods in typical application scenarios such as relational database, scientific workflow, big data platform, cloud computing and block chain. This paper also focuses on the research and practice of data provenance in the fields of digital library, archival information management, online information resources management, scientific data sharing and curation as well as electronic commerce information system.
On the whole, there are still some limitations in the research on data provenance technology, standards and specifications, information security, block chain fusion and model extension and verification. In the future, more in depth research and practical exploration are needed in these areas. This paper is expected to help scholars of information resources management and data science have a comprehensive understanding of data provenance. The limitation of this paper is that it fails to formulate a data provenance model for information resources management within the required length, which will be conducted independently in future research. 136 refs.
中文摘要:
      真实性和可靠性是当前各领域对数据的根本要求,基于数据溯源实现数据的质量控制与可信管理具有重要的研究价值和实践意义。数据溯源不仅是一个技术问题,同时也是一个管理问题,在数据科学范式下应当受到信息资源管理研究的关注和重视。鉴于此,本文结合相关领域的最新研究进展,系统阐述了数据溯源的概念发展与内涵;梳理了面向数据溯源管理的信息描述模型、通用表达模型、领域应用模型、安全管理模型与区块链溯源管理模型;描述了关系数据库、科学工作流、大数据平台、云计算和区块链等典型应用环境下的数据溯源计算方法。此外,本文还重点分析了数据溯源在数字图书馆、档案信息管理、网络信息资源管理、科学数据共享管理及电子商务信息系统等信息资源管理研究领域中的应用价值与相关实践,并对数据溯源技术方法、标准规范、信息安全、区块链融合以及模型扩展验证等方面的发展进行了展望,以期为数据管理和数据科学领域的研究人员提供参考。参考文献136。
View Full Text   View/Add Comment  Download reader