jdpressman/comma_v0.1_training_dataset_sample_10B

Name: jdpressman/comma_v0.1_training_dataset_sample_10B
Creator: jdpressman
Published: 2025-08-23 18:22:53
License: 暂无描述

Hugging Face2025-08-23 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/jdpressman/comma_v0.1_training_dataset_sample_10B

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个Comma v0.1训练数据集的10亿标记子集，旨在方便进行小型深度学习实验。它与RedPajama样本类似。README文件中包含了用于子集数据的脚本，以及该过程和所使用工具的详细信息，如AutoTokenizer和datasets库。但是README文件中未提供数据集本身的描述。

This is a 10 billion token subset of the Comma v0.1 Training Set, intended for small deep learning experiments. It is similar in spirit to the RedPajama sample. The README file includes a script used to subset the data, along with details about the process and the tools used, such as AutoTokenizer and the datasets library. However, the README does not provide a description of the dataset itself.

提供机构：

jdpressman

5,000+

优质数据集

54 个

任务类型

进入经典数据集