杨建梁,刘越男,祁天娇.文档数据化:概念、框架与方法[J].中国图书馆学报,2022,48(3):63~78
Documents Datafication: Concept, Framework and Methods
文档数据化:概念、框架与方法
Received:September 13, 2021  
DOI:
Key words:Document  Datafication  Unstructured data  Structurization  Quantification
中文关键词:  文档  数据化  非结构化数据  结构化  量化
基金项目:本文系中国博士后科学基金面上资助一等项目“基于深度学习与事件知识图谱的数字文书档案价值鉴定研究”(编号:2020M680029)的研究成果之一
Author NameAffiliation
YANG Jianliang 中国人民大学信息资源管理学院、数据工程与知识工程教育部重点实验室、中国人民大学电子文件管理研究中心 北京 100872 
LIU Yuenan 中国人民大学信息资源管理学院、数据工程与知识工程教育部重点实验室、中国人民大学电子文件管理研究中心 北京 100872 
QI Tianjiao 中国人民大学信息资源管理学院、数据工程与知识工程教育部重点实验室、中国人民大学电子文件管理研究中心 北京 100872 
Hits: 646
Download times: 549
Abstract:
The value of data is becoming highly recognized. However, according to statistics, most data is unstructured documents that cannot be directly analyzed and calculated by machine, such as books, periodicals, documents, archives and so on, whose value is difficult to be fully released. In order to further utilize big data, artificial intelligence and other technologies to release the value of data, the concept of documents datafication has been put forward and focused. Datafication is becoming a new field of digital transformation in library science, information science and archival science. However, the concept of documents datafication is still vague and a systematic framework has not been formed yet.
In order to consolidate the conceptual foundation of our discipline and effectively promote the development of the theory and practice of documents datafication, the authors systematically conduct studies on the conceptual connotation, content framework and core methods of documents datafication through synthesis and deduction of multi disciplinary concepts and related methods.
Documents datafication is defined in this paper as a process which transforms documents into data that can be recognized, analyzed, and computed by machines for the purpose of development and utilization of information resource. Intelligent technologies allow machines to participate in the decision making process of documents datafication, making the documents datafication present the characteristics of humachined cooperation, utilization driven, granularity refined and computing oriented. Based on the findings described above, the authors further put forward the task framework of documents datafication. It mainly includes four tasks: transcription recognition, description enhancement, linkage construction and vectorization processing. The four tasks present a machine oriented evolution mechanism simultaneously on three dimensions, namely the structuration dimension, the semantic dimension and the intelligentized dimension. It is found that the methods centered on deep learning, natural language processing and other technologies for documents datafication are playing an increasingly important role after the fundamental and key methods involved in the four tasks of documents datafication are combed. 6 figs. 6 tabs. 36 refs.
中文摘要:
      数据价值已经得到社会各界的高度认可。为进一步利用大数据、人工智能等技术释放数据的价值,文档数据化的概念被提出并日益受到重视,也成为图书情报与档案管理学科数字转型的新领域。经多学科概念与方法的综合和推演,本文对文档数据化的概念内涵、内容框架和关键方法展开系统研究。研究发现,文档数据化是面向文档的开发利用,将文档转变为机器可识别、可分析、可计算的数据的过程;智能技术允许机器参与到文档数据化的决策过程中,使得文档数据化呈现出人机协同、利用驱动、粒度细化、面向计算的特点。基于以上研究,本文提出文档数据化的任务框架,包含转录识别、描述增强、关联构建和矢量处理四项任务,呈现出结构化、语义化和智能化三个维度上面向机器的演进机制。对各项任务涉及的基础方法和关键方法进行梳理后可知,以深度学习、自然语言处理等技术为核心的文档数据化方法正在发挥越来越重要的作用。图6。表6。参考文献36。
View Full Text   View/Add Comment  Download reader