张智雄,赵旸,刘欢.构建面向实际应用的科技文献自动分类引擎[J].中国图书馆学报,2022,48(4):104~115
Construction of a Practical Application oriented Automatic Classification Engine for Scientific Literature
构建面向实际应用的科技文献自动分类引擎
Received:September 28, 2021  Revised:May 06, 2022
DOI:
Key words:Literature classification  Scientific literature  Automatic classification  Classification engine  Hierarchical classification  Classifier cluster
中文关键词:  文献分类  科技文献  自动分类  分类引擎  层次分类法  分类器集群
基金项目:本文系中国科学院文献情报能力建设专项课题“基于科技文献知识的人工智能(AI)引擎建设”(编号:E0290906)的研究成果之一
Author NameAffiliation
ZHANG Zhixiong 中国科学院文献情报中心、中国科学院大学经济与管理学院图书情报与档案管理系 北京 100190 
ZHAO Yang 中国科学院文献情报中心、中国科学院大学经济与管理学院图书情报与档案管理系 北京 100190 
LIU Huan 中国科学院文献情报中心、中国科学院大学经济与管理学院图书情报与档案管理系 北京 100190 
Hits: 518
Download times: 365
Abstract:
Literature classification is a traditional research problem in library and information science. One of the most important requirements for a practical automatic classification system based on Chinese Library Classification is to automatically classify literature into its third or fourth level. This means that it is necessary to be able to accurately and automatically classify specific literature into thousands of categories of Chinese Library Classification. In order to build a practical application oriented automatic classification engine of scientific literature,we use the idea of hierarchical classification to design the engine's logical and technical framework,and realize an automatic classification engine system for scientific documents based on a multi layer classifier cluster.
This paper focuses on four key issues in the construction of automatic classification engine for scientific literature: 1) How to obtain and construct large scale high quality classification training data to improve the effect of automatic classification; 2) How to design and implement a multi layer classifier cluster to effectively solve the accuracy problem of thousands of categories automatic classification; 3) How to face practical requirements and optimize the processing flow to improve the classification speed; 4) How to design and open the interface to support the open call of the engine. In view of the above four key issues,we propose corresponding solutions: 1) Building large scale high quality classification training data based on the Chinese Science Citation Database (CSCD) to improve the classification effect. In addition,we design category screening rules to balance the unevenness of categories among all levels; 2) We construct 26 automatic classifiers based on the BERT pre training language model,and assemble them into a two layer automatic classification cluster framework to achieve accurate automatic classification of scientific literature; 3) Compressing the classification model based on the microservice,and optimizing the batch literature classification processing flow to improve classification speed; 4)Developing an HTTP based Application Programming Interface (API) so that the classification engine can be openly called.
The study finally built an automatic classification engine for scientific literature that includes 2118 categories and a total of about 18 million high quality scientific literature corpus,and it can automatically classify literature into the third or fourth level of the Chinese Library Classification. Moreover,we select journal papers and dissertations to evaluate the classification effect,and the complete discrepancy rate between the classification results and the original classification number is only about 1%. In addition to the classification effect,the classification speed and other indicators of the classification engine have all reached the practical requirements,which proves that the practical level automatic classification system based on the Chinese Library Classification method has been initially realized. 4 figs. 7 tabs. 16 refs.
中文摘要:
      文献分类是图书馆学情报学领域的一个传统研究问题。实用化的中图法自动分类系统最重要的一个要求就是能够将文献精确地自动分类到三级或四级类目之下,这意味着需要将特定文献较为精确地自动分类到上千个类目之下。为了构建面向实际应用的科技文献中图法自动分类引擎,本文基于层次分类思想,设计和实现了一个基于多层分类器集群的科技文献自动分类引擎系统,并重点解决了科技文献自动分类引擎建设中的四个关键问题:①如何获取并构建大规模高质量分类训练数据以提升自动分类效果;②如何设计和实现多层分类器集群以有效解决上千个类目自动分类的准确性;③如何面向现实要求来优化处理流程以提升分类速度;④如何设计和开放接口以支撑引擎的开放调用。最终构建了科技文献自动分类引擎,各项指标达到了实用化要求,初步实现了基于中图法的自动分类系统的实际应用。图4。表7。参考文献16。
View Full Text   View/Add Comment  Download reader