blythet/fineweb-edu-top1m
收藏Hugging Face2026-02-17 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/blythet/fineweb-edu-top1m
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
task_categories:
- text-generation
- text-classification
size_categories:
- n<1M
tags:
- fineweb-edu
- educational
- high-quality
- filtered
dataset_info:
features:
- name: text
dtype: string
- name: score
dtype: float64
- name: url
dtype: string
- name: id
dtype: string
---
# FineWebEdu Top 1M (Score >= 4.0)
Top 1 million highest-educational-quality documents from [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), filtered to score >= 4.0 (scale 0-5).
## Quick Use
```python
from datasets import load_dataset
ds = load_dataset("blythet/fineweb-edu-top1m", split="train")
```
## Details
- **Source**: HuggingFaceFW/fineweb-edu
- **Filter**: `score >= 4.0` (top ~2.2% of FineWebEdu)
- **Size**: 1M documents, ~1.9GB compressed Parquet
- **Avg score**: 4.182
- **Avg chars**: ~5,079 per document
- **Columns**: text, score, url, id
- **Sorted by**: score descending (highest quality first)
## License
ODC-By 1.0 (same as FineWebEdu), subject to CommonCrawl Terms of Use.
提供机构:
blythet



