EliMC/fineweb-edu-10BT-mincols
收藏Hugging Face2025-12-05 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/EliMC/fineweb-edu-10BT-mincols
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
- name: id
dtype: string
- name: url
dtype: string
splits:
- name: train
num_bytes: 47250994990
num_examples: 9672101
download_size: 28314342419
dataset_size: 47250994990
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
source_datasets: HuggingFaceFW/fineweb-edu
license: odc-by
task_categories:
- text-generation
language:
- en
tags:
- 10BT
size_categories:
- 1M<n<10M
---
# fineweb-edu: 10BT sample
This the "10BT-sample" config of `HuggingFaceFW/fineweb-edu` with most of the redundant cols removed for efficiency reasons.
## token counts
GPT-4 tiktoken token count:
```
token_count
count 9.672101e+06
mean 1.001188e+03
std 1.834986e+03
min 3.800000e+01
25% 3.380000e+02
50% 6.090000e+02
75% 1.054000e+03
max 1.649670e+05
```
- Total count: 9683.59 M tokens
提供机构:
EliMC



