melaniaghirda/fineweb-edu-subset
收藏Hugging Face2026-03-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/melaniaghirda/fineweb-edu-subset
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- text-generation
language:
- en
dataset_info:
config_name: data
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 21893418334
num_examples: 4020432
download_size: 12544429047
dataset_size: 21893418334
configs:
- config_name: data
data_files:
- split: train
path: data/train-*
default: true
---
Small slice of the original HuggingFaceFW/fineweb-edu, data/CC-MAIN-2025-26.<br>
Files: first 10 `.parquet` files, split=`train`, columns="text", applied filters: "language_score">=0.9.<br>
Pipeline: Dataset will be further tokenized and used to train a 124M GPT model.
提供机构:
melaniaghirda



