Page 188 - Journal of Library Science in China, Vol.47, 2021
P. 188

ZHANG Wei, WANG Hao, DENG Sanhong & ZHANG Baolong / Sentiment term extraction   187
                                                       and application of Chinese ancient poetry text for digital humanities


               essays were obtained, and the three-level fields of “article title”, “poetry essay” and “appreciation”
               were selected as the corpus. Further use “.”, “?”, “!”, “;” as the sentence segmentation standard to
               obtain 51, 545 paragraphs, and set the initial category of the paragraphs to “0” to form a corpus to
               be (un)labeled, which is called a cold corpus in this paper. In addition, due to the relatively small
               length of the poems in this corpus, in order to verify its validity, the author introduces the Tang
               poetry corpus other than the poems in this article from the external open source “GitHub” website
               (https://github.com/Werneror/Poetry), and constructs a new training set based on poems and texts,
               so that the text size of the new training set is consistent with that of the training set of this paper.
               Furthermore, the difference of the performance of the models on different training sets in terms
               extraction is compared.


               2.3 Automatic generation of emotion learning corpus of ancient poems based on cold
               start

               Cold start refers to a method that generates learning corpus by automatically labeling cold corpus
               with relevant domain knowledge under the condition of no or little learning corpus, so as to
               achieve target domain model training. In this paper, the core mechanism of cold priming is to
               match terms in cold corpus through affective word sets, map text fragments into word sequences
               with the help of word labeling model, and label roles at the same time, so as to realize automatic
               generation of learning corpus.
                 (1) Emotional word set, which is used for the matching of terms in domain-specific texts and the
               automatic labeling of their roles, mainly derived from general emotional dictionaries and domain-
               specific emotional vocabulary. Among them, the emotional dictionaries commonly used in the
               field of Chinese include CNKI How Net Emotional Dictionary, Li Jun Emotional Dictionary of
               Tsinghua University, Dalian University of Technology Emotional Vocabulary Library, Taiwan
                                             [30]
               University Emotional Dictionary, etc . The evaluation vocabulary in vocabulary and appreciation
               is obtained by searching the web resources by the author. After combining the two, the repeated
               words are removed, and finally an emotional word set in the field of ancient poetry with knowledge
               expansion and extension is formed.
                 (2) Word annotation model. There are three considerations for defining the character tagging
               model as a feature map: First, Chinese uses characters as the smallest knowledge unit. Compared
               with words, character tagging can mine finer-grained semantic features; second, due to the
               continuity of Chinese text, the use of word role tagging needs to segment the domain-specific texts
               first. However, the existing Chinese word segmentation technology is still immature, especially
               for ancient Chinese texts, which can easily lead to wrong segmentation of text sequences and
               fragmentation of long terms (such as “岂不/惮/艰险”) and other issues; third, BERT is a language
               model for word embedding, and its Char2Vec mapping also provides ideas for the introduction
               of Chinese character features in the CRFs model. For this reason, the author adopts word role
   183   184   185   186   187   188   189   190   191   192   193