ByteSpanTokenisers/common-corpus
收藏Hugging Face2025-06-24 更新2025-11-01 收录
下载链接:
https://hf-mirror.com/datasets/ByteSpanTokenisers/common-corpus
下载链接
链接失效反馈官方服务:
资源简介:
Common Corpus 25是一个主要用于语言模型训练的语料库,包含多种配置如bytelevel、bytelevel-llm-data以及子集。该数据集以英语为主要语言,适用于构建和训练语言模型。数据集的大小范围在10B到100B之间。
Common Corpus 25 is a corpus primarily designed for language model training, including configurations like bytelevel, bytelevel-llm-data, and subsets. The dataset is mainly in English, suitable for building and training language models. The size of the dataset ranges from 10B to 100B.
提供机构:
ByteSpanTokenisers



