chengjunyan1/smollm-10
收藏Hugging Face2024-08-05 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/chengjunyan1/smollm-10
下载链接
链接失效反馈官方服务:
资源简介:
cosmopedia-v2数据集包含prompt、text、token_length、audience、format和seed_data等特征,主要用于训练,包含4537737个样本。fineweb-edu-dedup数据集包含text、id和metadata等特征,metadata中包括dump、url、date、file_path、language、language_score、token_count、score和int_score等子特征,主要用于训练,包含22701367个样本。python-edu数据集包含blob_id、repo_name、path、length_bytes、score和int_score等特征,主要用于训练,包含864074个样本。
The cosmopedia-v2 dataset includes features such as prompt, text, token_length, audience, format, and seed_data, primarily used for training, containing 4,537,737 samples. The fineweb-edu-dedup dataset includes features such as text, id, and metadata, with metadata containing sub-features like dump, url, date, file_path, language, language_score, token_count, score, and int_score, primarily used for training, containing 22,701,367 samples. The python-edu dataset includes features such as blob_id, repo_name, path, length_bytes, score, and int_score, primarily used for training, containing 864,074 samples.
提供机构:
chengjunyan1



