CBooks
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/FudanNLPLAB/CBook-150K
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为CBooks,是通过开源MD5书籍链接收集的大规模中文书籍语料库,为语言建模提供了宝贵的长距离上下文信息。此外,该数据集因提升了连贯叙事和长距离上下文建模的能力而受到认可。其规模超过10万本图书,旨在用于大型语言模型的预训练任务。
The dataset named CBooks is a large-scale Chinese book corpus collected via open-source MD5 book links, which provides valuable long-distance contextual information for language modeling. Moreover, this dataset is recognized for enhancing the capabilities of coherent narrative and long-distance contextual modeling. Boasting a scale of over 100,000 books, it is specifically designed for pre-training tasks of large language models.
提供机构:
FudanNLPLAB



