Page 123 - Journal of Library Science in China, Vol.45, 2019
P. 123
122 Journal of Library Science in China, Vol.11, 2019
2.2 Data acquisition
The metadata information of books and the metadata information of citing literatures of books
in this paper were collected from Amazon.cn and Baidu Scholar , respectively . Currently,
〇a ①
〇c ③
〇b ②
mainstream Chinese full-text databases, such as CNKI , WanFang and WeiPu , missed parts of
〇f ⑥
〇d ④
〇e ⑤
literatures. Compared with using a single Chinese full-text database as the retrieval entrance, Baidu
Scholar is more likely to cover the information of all citation literatures. In order to find all citation
literature information of books, this study took Baidu Scholar as the retrieval entrance, and used
the metadata information of Chinese books as the retrieval keywords to obtain the citing literature
information of books. To identify book disciplines, we firstly matched the first-class category
of Chinese books provided by Amazon and the Chinese discipline category. Meanwhile, we
considered differences between natural disciplines and humanities and social disciplines. Finally,
we identified five disciplines, including computer science, law, literature, medicine and sport
science. We then obtained the citation content corpus of Chinese books from the full-text databases
through the following two steps.
(1) We selected the books based on three rules: 1) More than 1 review in Amazon.cn; 2) More
than 1 citation in Baidu Scholar; 3) Must contain tables of contents. We obtained 6,006 books in
the five disciplines.
(2) In order to ensure the accuracy of the citation contents, we obtained the citation sentence and
their contexts (i.e. the former two sentences and the latter two sentences of the citation content)
of these books by manual annotation. Meanwhile, we considered the high cost of the manual
method, and the distribution differences of the 6,006 books. For example, citations were cited
more often between 0 and 5 times, while relatively few citations were cited more than 15 times.
Hence, to make the citation content data more representative, we extracted books in each citation
interval according to the distribution proportion of book citations (i.e., the distribution ratios of
6,006 books in each citation interval). For selecting data, we analyzed the distribution of citations,
and finally selected 0-5, 6-10, 11-15, 16-20 and more than 20 as the intervals. Since the full-
texts of some literatures cannot be obtained and there were no citation marks in the full texts, we
finally obtained 399 Chinese books and their citation contents in the citing literatures. The specific
citation distributions of books are shown in Table 1.
① Available at: https://www.amazon.cn/.
② Available at: http://xueshu.baidu.com/.
③ The data collection was completed in November 2016.
④ Available at: http://www.cnki.net/.
⑤ Available at: http://www.wanfangdata.com.cn.
⑥ Available at: http://www.cqvip.com/.