deatos/mixtraltoken_fineweb_edu_mini_combined
收藏Hugging Face2024-07-02 更新2024-07-06 收录
下载链接:
https://hf-mirror.com/datasets/deatos/mixtraltoken_fineweb_edu_mini_combined
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含多个特征,如id、dump、url、file_path、language、language_score、token_count、score、int_score、input_ids、attention_mask和labels。数据集分为训练集和验证集,训练集包含2,182,000个样本,验证集包含182,101个样本。数据集的下载大小为4,550,891,871字节,总大小为16,514,852,838字节。
The dataset contains multiple features such as id, dump, url, file_path, language, language_score, token_count, score, int_score, input_ids, attention_mask, and labels. The dataset is divided into a training set and a validation set, with the training set containing 2,182,000 samples and the validation set containing 182,101 samples. The download size of the dataset is 4,550,891,871 bytes, and the total size is 16,514,852,838 bytes.
提供机构:
deatos
原始信息汇总
数据集概述
数据集特征
- id: 字符串类型
- dump: 字符串类型
- url: 字符串类型
- file_path: 字符串类型
- language: 字符串类型
- language_score: 浮点数类型
- token_count: 整数类型
- score: 浮点数类型
- int_score: 整数类型
- input_ids: 整数序列类型
- attention_mask: 整数序列类型
- labels: 整数序列类型
数据集分割
- train:
- 字节数: 15243206570
- 样本数: 2182000
- validation:
- 字节数: 1271646268
- 样本数: 182101
数据集大小
- 下载大小: 4550891871 字节
- 数据集总大小: 16514852838 字节
配置
- config_name: default
- data_files:
- train: data/train-*
- validation: data/validation-*
- data_files:



