Page 188 - Journal of Library Science in China, Vol.47, 2021

P. 188

ZHANG Wei, WANG Hao, DENG Sanhong & ZHANG Baolong / Sentiment term extraction 187
and application of Chinese ancient poetry text for digital humanities

essays were obtained, and the three-level fields of “article title”, “poetry essay” and “appreciation”
were selected as the corpus. Further use “.”, “?”, “!”, “;” as the sentence segmentation standard to
obtain 51, 545 paragraphs, and set the initial category of the paragraphs to “0” to form a corpus to
be (un)labeled, which is called a cold corpus in this paper. In addition, due to the relatively small
length of the poems in this corpus, in order to verify its validity, the author introduces the Tang
poetry corpus other than the poems in this article from the external open source “GitHub” website
(https://github.com/Werneror/Poetry), and constructs a new training set based on poems and texts,
so that the text size of the new training set is consistent with that of the training set of this paper.
Furthermore, the difference of the performance of the models on different training sets in terms
extraction is compared.

2.3 Automatic generation of emotion learning corpus of ancient poems based on cold
start

Cold start refers to a method that generates learning corpus by automatically labeling cold corpus
with relevant domain knowledge under the condition of no or little learning corpus, so as to
achieve target domain model training. In this paper, the core mechanism of cold priming is to
match terms in cold corpus through affective word sets, map text fragments into word sequences
with the help of word labeling model, and label roles at the same time, so as to realize automatic
generation of learning corpus.
(1) Emotional word set, which is used for the matching of terms in domain-specific texts and the
automatic labeling of their roles, mainly derived from general emotional dictionaries and domain-
specific emotional vocabulary. Among them, the emotional dictionaries commonly used in the
field of Chinese include CNKI How Net Emotional Dictionary, Li Jun Emotional Dictionary of
Tsinghua University, Dalian University of Technology Emotional Vocabulary Library, Taiwan
[30]
University Emotional Dictionary, etc . The evaluation vocabulary in vocabulary and appreciation
is obtained by searching the web resources by the author. After combining the two, the repeated
words are removed, and finally an emotional word set in the field of ancient poetry with knowledge
expansion and extension is formed.
(2) Word annotation model. There are three considerations for defining the character tagging
model as a feature map: First, Chinese uses characters as the smallest knowledge unit. Compared
with words, character tagging can mine finer-grained semantic features; second, due to the
continuity of Chinese text, the use of word role tagging needs to segment the domain-specific texts
first. However, the existing Chinese word segmentation technology is still immature, especially
for ancient Chinese texts, which can easily lead to wrong segmentation of text sequences and
fragmentation of long terms (such as “岂不/惮/艰险”) and other issues; third, BERT is a language
model for word embedding, and its Char2Vec mapping also provides ideas for the introduction
of Chinese character features in the CRFs model. For this reason, the author adopts word role

183 184 185 186 187 188 189 190 191 192 193