Trelis/smollm-corpus-2percent
收藏Hugging Face2024-09-17 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/Trelis/smollm-corpus-2percent
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含多个配置,如cosmopedia、default、fineweb及其分块(fineweb_chunk_0到fineweb_chunk_7)。cosmopedia和default配置包含prompt、text、token_length、audience、format和seed_data等特征。fineweb及其分块配置包含text、id和metadata等特征,其中metadata包含日期、dump、文件路径、整数分数、语言、语言分数、分数、token计数和URL等信息。每个配置都有训练集的详细信息,包括字节数和样本数。
The dataset includes multiple configurations such as cosmopedia, default, fineweb, and its chunks (fineweb_chunk_0 to fineweb_chunk_7). The cosmopedia and default configurations include features like prompt, text, token_length, audience, format, and seed_data. The fineweb and its chunk configurations include features like text, id, and metadata, where metadata includes details such as date, dump, file path, integer score, language, language score, score, token count, and URL. Each configuration has detailed information about the training set, including the number of bytes and examples.
提供机构:
Trelis



