blythet/diverse-2.5m
收藏Hugging Face2026-02-22 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/blythet/diverse-2.5m
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
task_categories:
- text-generation
language:
- en
size_categories:
- 1M<n<10M
tags:
- diverse
- curated
- deduplication
- multi-domain
- stem
- legal
- scientific
- encyclopedic
- source-text
configs:
- config_name: default
data_files:
- split: train
path: cot_diverse_2.5m.parquet
pretty_name: Diverse Source Text Dataset (2.5M)
dataset_info:
features:
- name: text
dtype: string
- name: id
dtype: string
- name: url
dtype: string
- name: source
dtype: string
- name: quality_score
dtype: float64
splits:
- name: train
num_examples: 2500000
---
# Diverse Source Text Dataset (2.5M)
A curated, deduplicated, multi-domain English text dataset blending 7 sources across STEM, legal, scientific, encyclopedic, Q&A, and general knowledge domains. Designed as high-quality, diverse source material for downstream NLP tasks such as synthetic data generation, fine-tuning, and text analysis.
## Dataset Summary
| | |
|---|---|
| **Total samples** | 2,500,000 |
| **Estimated tokens** | ~2.8B (GPT-2) / ~2.4B (modern tokenizers) |
| **Language** | English |
| **Format** | Parquet (ZSTD compressed) |
| **File size** | 4.28 GB |
| **Text length** | 200 - 50,000 characters |
| **Mean length** | 4,656 characters (~1,107 tokens) |
| **Median length** | 2,439 characters |
## Source Breakdown
| Source | Samples | Share | Avg Chars | Avg Tok/Doc | Quality Score | Domain |
|--------|--------:|------:|----------:|------------:|--------------:|--------|
| FineWeb EDU (broad, 3.0-4.0) | 750,000 | 30% | 4,997 | 1,063 | 3.39 | General educational |
| DCLM-baseline | 500,000 | 20% | 2,295 | 572 | 0.89 | Commonsense / explanatory |
| FineWeb EDU (high, >= 4.0) | 375,000 | 15% | 4,923 | 1,023 | 4.18 | STEM / high-quality educational |
| Pile - FreeLaw | 250,000 | 10% | 14,458 | 3,781 | N/A | Legal (court opinions, filings) |
| Pile - PubMed Abstracts | 250,000 | 10% | 1,335 | 292 | N/A | Biomedical / scientific |
| Pile - StackExchange | 200,000 | 8% | 2,190 | 761 | N/A | Technical Q&A |
| Pile - Wikipedia (en) | 175,000 | 7% | 2,923 | 685 | N/A | Encyclopedic |
## Schema
```
text: string # The document text (200-50,000 chars)
id: string # Unique document identifier from source
url: string # Source URL (null for Pile sources)
source: string # One of 7 source labels
quality_score: float64 # Source-specific quality score (null for Pile sources)
```
## Methodology
### Collection
- **FineWeb EDU**: Streamed from [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) across 12 Common Crawl dumps, filtered by educational quality score
- **DCLM-baseline**: Streamed from [mlfoundations/dclm-baseline-1.0](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) with fasttext quality score >= 0.65
- **Pile subsets**: Streamed from [monology/pile-uncopyrighted](https://huggingface.co/datasets/monology/pile-uncopyrighted), filtered by subset name
### Filtering
- Minimum 200 characters, maximum 50,000 characters
- 20% over-fetch to absorb deduplication losses
### Deduplication (3-stage)
1. **Exact text dedup**: MD5 hash of normalized text (lowercased, whitespace-collapsed) - removed 70,433 (2.3%)
2. **URL dedup**: Normalized URL matching - removed 19,283
3. **Near-dedup (anchor pairs)**: Three passes using MD5 hashes of text start/mid/end 500-char anchors - removed 3,353
Total removed: 93,069 / 3,000,000 (3.1%)
### Final Assembly
- Each source trimmed to exact target count, prioritizing highest quality scores
- Globally shuffled via deterministic hash (seed=42)
- Written as single Parquet file with ZSTD compression
## Usage
```python
from datasets import load_dataset
ds = load_dataset("blythet/diverse-2.5m", split="train")
print(ds)
# Dataset({
# features: ['text', 'id', 'url', 'source', 'quality_score'],
# num_rows: 2500000
# })
# Filter by source
stem = ds.filter(lambda x: x["source"] == "fineweb_edu_high")
# Filter by quality
high_quality = ds.filter(lambda x: x["quality_score"] is not None and x["quality_score"] >= 4.0)
```
## Intended Use
This dataset provides high-quality, diverse English text suitable for:
- Synthetic data generation (e.g., chain-of-thought, instruction tuning)
- Fine-tuning language models across multiple domains
- Text analysis and NLP research
- Domain-specific data extraction (legal, scientific, educational, technical)
The domain diversity covers STEM, legal reasoning, scientific literature, technical Q&A, encyclopedic knowledge, and general commonsense explanations.
## Limitations
- Quality scores are only available for FineWeb EDU and DCLM sources; Pile subsets have `null` quality scores
- URLs are only available for FineWeb EDU and DCLM sources
- Text is English-only
- The dataset inherits any biases present in the upstream sources
## License
This dataset is released under **ODC-By** (Open Data Commons Attribution License), consistent with the upstream source licenses:
- FineWeb EDU: ODC-By
- DCLM-baseline: ODC-By
- Pile (uncopyrighted subsets): Public domain / permissive
## Citation
```bibtex
@dataset{diverse_2.5m,
title={Diverse Source Text Dataset},
author={blythet},
year={2025},
url={https://huggingface.co/datasets/blythet/diverse-2.5m},
note={2.5M curated, deduplicated multi-domain English texts}
}
```
提供机构:
blythet



