kanishka/babylm2-sentence-tokenized
收藏Hugging Face2024-08-07 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/kanishka/babylm2-sentence-tokenized
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含一个名为text的字段,数据类型为字符串。数据集分为训练集和验证集,训练集包含13,060,876个样本,大小为590,245,654字节;验证集包含1,317,808个样本,大小为61,742,239字节。数据集的下载大小为382,795,379字节,总大小为651,987,893字节。数据文件的路径和分割方式在配置部分有详细描述。
The dataset contains a field named text with a data type of string. The dataset is divided into a training set and a validation set. The training set contains 13,060,876 samples with a size of 590,245,654 bytes, and the validation set contains 1,317,808 samples with a size of 61,742,239 bytes. The download size of the dataset is 382,795,379 bytes, and the total size is 651,987,893 bytes. The configuration section describes the paths and splits of the data files in detail.
提供机构:
kanishka



