Page 104 - JOURNAL OF LIBRARY SCIENCE IN CHINA 2018 Vol. 42
P. 104
WU Wenna & BAO Xiulin / The architecture and data model of the National Thesauri Warehouse 103
2.1 Data acquisition and transformation
Most Chinese thesauri are influenced by CT and share common characteristics in their structures
and description patterns. This facilitates uniform and normative description of thesauri, as well as
data storage in a consistent format. However, due to that the thesauri belong to different scientific
disciplines, designed and developed by different institutions, it is almost impossible to ensure that
their macrostructure and microstructure are always consistent. Therefore, the NTW project must
investigate and analyze the similarity and individuality of structures and description patterns of
the Chinese thesauri to resolve problems of description inconsistency of thesaurus knowledge
structure.
The major functions of the data acquisition and transformation layer include thesaurus data
acquiring, normative data description and format transformation. This layer has three specific
modules: thesaurus metadata registration, thesaurus data importing and verification, thesaurus
uniform description and format transformation.
Thesaurus metadata registration is an operation of recording metadata of thesauri. Thesaurus
metadata include thesaurus names, developers, publication dates, disciplines, copyright, etc.
Thesaurus metadata registration assembles basic information of Chinese thesauri developed in
different stages, and helps users to discover and locate useful thesaurus resources. Thesaurus data
importing and verification are operations of importing terms and relations from the registered
thesauri to the system, as well as verifying and controlling data quality. Thesauri in print need
be digitized first. Digitized thesauri usually have different formats since they are obtained from
different sources. Therefore, the system should have functions of supporting data-importing in
different formats. Thesauri constructed in early years were confined by technical conditions at
that days and usually have some logical problems, such as conflict, redundancy and circulation
of relations (Wu & Wang, 2012). The verification module has functions of finding and resolving
logical problems automatically to ensure valid logic, and reduce the loss of term information as
low as possible. Verified thesauri data can be described according to a uniform metadata scheme
and be stored in a uniform format.
2.2 Storage and semantic integration
2.2.1 Top classification and ontology
A thesaurus usually has a category system or a classification to categorize its terms. Thesauri
usually have different classifications. To an integration system, top classification provides
a uniform navigation system for concepts from its source thesauri, which facilitates the
implementation of thesauri semantic integration. NTW semantic integration system establishes two
top classification schemes based on subjects and ontology: a top classification and a top ontology.