| 沈思,贾智,赵文华,王东波.人文社会科学学术大语言模型构建研究[J].中国图书馆学报,2026,52(2):107~125 |
| 人文社会科学学术大语言模型构建研究 |
| The Construction of Large Language Model for Humanities and Social Sciences Academia |
| 投稿时间:2026-01-29 修订日期:2026-03-15 |
| DOI: |
| 中文关键词: 人文社会科学 大语言模型 继续预训练 新文科 图书情报学 |
| 英文关键词: Humanities and social sciences Large language model Continued pre training New liberal arts Library and information science |
| 基金项目: |
|
| 摘要点击次数: 253 |
| 全文下载次数: 205 |
| 中文摘要: |
“数智+人文”推动新文科建设,支撑学科交叉进而扩展自身内涵与外延,而领域化大语言模型构建是两者结合的必要前提。本研究构建了包含116亿词元的人文社会科学学术语料数据集,在此基础上,结合人文社会科学的领域特点,制定了专用的大语言模型评价体系,并以性能较为优越的Qwen3为基座大语言模型,构建了人文社会科学学术大语言模型HssaLLM。该模型的构建为大语言模型赋能人文社会科学发展奠定了坚实的基础,为跨学科研究提供了有力的数据支持和智能方法论指引,从而有利于促进新文科朝着更加数据化、信息化和智能化的方向发展。图6。表5。参考文献21。
|
| 英文摘要: |
The integration of artificial intelligence and digital technologies into the humanities and social sciences(HSS) provides essential methodological support for the advancement of the “new liberal arts” However,existing general purpose large language models(LLMs) often lack specialized domain knowledge in these fields. Furthermore,current academic models frequently rely on titles and abstracts,failing to capture the comprehensive semantic information contained in full text academic literature. This study aims to address this gap by constructing a domain specific LLM tailored to HSS research,thereby facilitating data driven and computationally assisted research paradigms. The research first addresses the data scarcity issue by compiling an extensive specialized HSS academic corpus,totaling approximately 116 billion tokens. This comprehensive dataset integrates full text resources from major Chinese and English databases(eg,CSSCI,SSCI,A&HCI,and Project MUSE),covering both journals and classic books to ensure balanced disciplinary representation. Building upon this corpus,the study establishes a domain specific evaluation framework comprising five core tasks:chapter title generation,conclusion generation,literature review generation,entity recognition,and automatic book classification. Qwen3 8B and Qwen3 32B were selected as the base models. The model construction followed a two stage process:first,continued pre training was conducted on the 116 billion token corpus to incorporate domain specific knowledge and reasoning patterns(creating HssaLLM Base);second,multi task instruction fine tuning was performed using a curated dataset of 15 697 instructions(creating HssaLLM) to enhance instruction following capabilities in academic scenarios. Comparative evaluations demonstrate that the HssaLLM series significantly outperforms general open source models(such as Llama31,GLM 4,and the original Qwen3 base) across all tested tasks. Specifically,HssaLLM 32B achieved measurable improvements in complex tasks such as entity recognition(F1 score reaching 6198% in Chinese contexts) and literature review generation,reducing common issues such as logical fragmentation and content hallucination. Both the HssaLLM models and the instruction fine tuning datasets have been made publicly available. This contribution provides a technical framework for intelligent information retrieval,automated academic writing assistance,and quantitative analysis in HSS. These resources aim to advance HSS research methodologies from traditional empirical description to AI assisted knowledge discovery,offering methodological support for the broader integration of AI within the humanities. 6 figs. 5 tabs. 21 refs.
|
|
查看全文
查看/发表评论 下载PDF阅读器 |
|
|
|