jdpressman/comma_v0.1_training_dataset_sample_10B
收藏Hugging Face2025-08-23 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/jdpressman/comma_v0.1_training_dataset_sample_10B
下载链接
链接失效反馈官方服务:
资源简介:
这是一个Comma v0.1训练数据集的10亿标记子集,旨在方便进行小型深度学习实验。它与RedPajama样本类似。README文件中包含了用于子集数据的脚本,以及该过程和所使用工具的详细信息,如AutoTokenizer和datasets库。但是README文件中未提供数据集本身的描述。
This is a 10 billion token subset of the Comma v0.1 Training Set, intended for small deep learning experiments. It is similar in spirit to the RedPajama sample. The README file includes a script used to subset the data, along with details about the process and the tools used, such as AutoTokenizer and the datasets library. However, the README does not provide a description of the dataset itself.
提供机构:
jdpressman



