blythet/diverse-source-3m
收藏Hugging Face2026-02-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/blythet/diverse-source-3m
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
task_categories:
- text-generation
- text-classification
language:
- en
size_categories:
- 1M<n<10M
tags:
- diverse
- curated
- deduplication
- multi-domain
- quality-scored
- fineweb
- pretraining
- web-text
- multi-source
configs:
- config_name: default
data_files:
- split: train
path: data/diverse_text_3m.parquet
pretty_name: Diverse Quality-Scored English Text (2.9M)
dataset_info:
features:
- name: text
dtype: string
- name: id
dtype: string
- name: url
dtype: string
- name: source
dtype: string
- name: quality
dtype: float32
- name: edu_score
dtype: float32
- name: reasoning_score
dtype: float32
- name: topic
dtype: string
- name: doc_type
dtype: string
- name: word_count
dtype: int32
splits:
- name: train
num_examples: 2887868
---
# Diverse Quality-Scored English Text (2.9M)
**2.9 million** quality-scored, deduplicated English documents from **12 sources** spanning STEM, law, medicine, math, code, news, philosophy, and general web text — each tagged with topic, document type, and three quality signals.
## The problem with existing datasets
Most large-scale text datasets have a quality-diversity tradeoff:
- **FineWeb EDU >= 4.0** scores high on benchmarks but is [heavily STEM/textbook biased](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier). The team themselves warn it *"might overfit to academic looking content."* Training on it improves MMLU/ARC but **degrades** commonsense benchmarks (HellaSwag, PIQA).
- **Generic web crawls** (C4, DCLM) have breadth but no quality signal — you're filtering blind.
- **Domain-specific corpora** (PubMed, FreeLaw) are high quality but narrow.
This dataset solves that by blending all three types with **unified quality scoring** across every document, so you can filter by quality without losing diversity.
## What you get
- **12 sources** with intentional balance — STEM is capped, not dominant
- **3 quality signals** on every row: `quality` (combined 0-1), `edu_score` (0-5, Nemotron-4), `reasoning_score` (0-1)
- **12 topic labels** and **7 document type labels** for precise filtering
- **3-stage deduplication** (exact hash + URL + near-duplicate) across all sources
- **Single parquet file**, ZSTD compressed, works with DuckDB/Pandas/HF datasets out of the box
## Quick start
```python
from datasets import load_dataset
ds = load_dataset("blythet/diverse-source-3m", split="train")
# Top 25% by quality
good = ds.filter(lambda x: x["quality"] >= 0.35)
# Just medical text
medical = ds.filter(lambda x: x["topic"] == "medicine")
# Q&A format documents with strong reasoning
qa_reasoning = ds.filter(lambda x: x["doc_type"] == "q_and_a" and x["reasoning_score"] >= 0.3)
```
DuckDB for fast analytics (no download required):
```python
import duckdb
df = duckdb.query("""
SELECT * FROM 'hf://datasets/blythet/diverse-source-3m/data/*.parquet'
WHERE quality >= 0.35 AND topic = 'science'
""").df()
```
## Columns
| Column | Type | Description |
|---|---|---|
| `text` | string | The document text |
| `id` | string | Unique document identifier |
| `url` | string | Source URL (null for domain-specific corpora like PubMed, FreeLaw, etc.) |
| `source` | string | Which of the 12 source datasets this came from |
| `quality` | float (0-1) | **Overall quality score.** Combines educational value and reasoning structure. Higher = better. Use this for filtering. |
| `edu_score` | float (0-5) | Educational quality from [Nemotron-4 classifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier) — coherence, informativeness, writing quality. |
| `reasoning_score` | float (0-1) | Density of reasoning structure (causal, logical, procedural markers). |
| `topic` | string | Subject area — one of 12 categories (see below). |
| `doc_type` | string | Document structure: `expository`, `q_and_a`, `tutorial`, `argument`, `explanation`, `narrative`, or `reference`. |
| `word_count` | int | Word count |
### How `quality` is calculated
```
quality = 0.70 * (edu_score / 5.0) + 0.30 * reasoning_score
```
A single 0-1 number balancing educational value (70%) with reasoning structure (30%). The raw scores are included if you want to weight them differently.
## Sources
| Source | Count | Quality | Edu | Reasoning | Avg Words | What it is |
|---|---|---|---|---|---|---|
| `fineweb_edu_broad` | 818,836 | 0.41 | 2.46 | 0.22 | 634 | FineWeb EDU score 3.0-3.99 — broad web text at the dev-recommended quality threshold |
| `dclm_baseline` | 575,000 | 0.27 | 1.47 | 0.21 | 349 | DCLM-baseline — commonsense and explanatory text (ELI5+OpenHermes quality signal) |
| `fineweb_edu_high` | 291,751 | 0.47 | 2.87 | 0.22 | 595 | FineWeb EDU score >= 4.0 — high-quality STEM, **capped** to prevent over-representation |
| `pile_pubmed` | 242,501 | 0.28 | 1.50 | 0.23 | 184 | PubMed abstracts — hypothesis, evidence, conclusion format |
| `pile_freelaw` | 200,000 | 0.09 | 0.24 | 0.18 | 1,763 | Court opinions — natural chains of legal reasoning |
| `pile_wikipedia` | 164,775 | 0.27 | 1.59 | 0.16 | 376 | Wikipedia EN — history, arts, social science, geography |
| `pile_stackexchange` | 162,083 | 0.22 | 1.05 | 0.26 | 251 | StackExchange — problem, diagnosis, solution across 170+ communities |
| `open_web_math` | 150,000 | 0.28 | 1.49 | 0.24 | 845 | Mathematical content, proofs, and derivations |
| `ccnews` | 125,000 | 0.18 | 0.82 | 0.21 | 432 | CC-News — journalism and current events |
| `the_stack_code` | 123,533 | 0.12 | 0.75 | 0.05 | 280 | Source code in Python, JavaScript, Rust, and Go |
| `philpapers` | 20,246 | 0.21 | 1.26 | 0.13 | 3,874 | Academic philosophy papers |
| `sec_finance` | 14,143 | 0.17 | 1.10 | 0.06 | 3,144 | SEC financial filings |
> **Note on FreeLaw's low edu_score:** Court opinions score 0.24 on educational quality because the Nemotron-4 classifier penalizes legal boilerplate, but they contain strong natural reasoning chains. The combined `quality` score accounts for both signals.
## Topics
| Topic | Count | % | Description |
|---|---|---|---|
| general | 1,237,260 | 42.8% | Broad web text not matching a specific domain |
| technology | 393,775 | 13.6% | Software, hardware, programming, IT |
| medicine | 356,307 | 12.3% | Clinical, biomedical, health |
| law | 210,363 | 7.3% | Legal opinions, statutes, case law |
| encyclopedia | 172,377 | 6.0% | Wikipedia-style reference and general knowledge |
| mathematics | 161,237 | 5.6% | Proofs, equations, mathematical reasoning |
| news | 127,533 | 4.4% | Journalism and current events |
| education | 77,122 | 2.7% | Teaching materials, courses, curricula |
| history | 58,940 | 2.0% | Historical events, periods, civilizations |
| science | 44,808 | 1.6% | Natural sciences, experiments, research |
| finance | 25,707 | 0.9% | Markets, investments, financial filings |
| philosophy | 22,439 | 0.8% | Ethics, epistemology, philosophical argument |
Topics are assigned by source for domain-specific corpora (e.g., all FreeLaw docs are `law`) and by URL domain + keyword classification for web sources.
## Document Types
| Type | Count | % | Description |
|---|---|---|---|
| expository | 2,237,151 | 77.5% | Informational prose — articles, explanations, descriptions |
| q_and_a | 465,359 | 16.1% | Question-and-answer format |
| explanation | 60,688 | 2.1% | Explicit explanatory structure ("this means...", "for example...") |
| argument | 58,594 | 2.0% | Argumentative structure ("therefore...", "it follows that...") |
| tutorial | 45,020 | 1.6% | Step-by-step instructions |
| narrative | 19,874 | 0.7% | Story-like structure with characters and events |
| reference | 1,182 | 0.0% | Dictionary/encyclopedia definitions |
## Deduplication
Three-stage deduplication across all 12 sources:
1. **Exact text hash** — MD5 of normalized text (lowercased, whitespace-collapsed). Removed 84,128 duplicates (2.7%).
2. **URL dedup** — Normalized URL matching. Removed 44,846.
3. **Anchor-pair near-dedup** — Documents sharing 2 of 3 anchor hashes (first/middle/last 500 chars) are near-duplicates. Removed 7,570.
When duplicates appeared across sources, specialized corpora (FreeLaw, PubMed) were kept over generic web text.
## Quality scoring details
**`edu_score`** comes from [`nvidia/nemocurator-fineweb-nemotron-4-edu-classifier`](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier) — trained on Nemotron-4-340B-Instruct annotations. This is the same classifier used in the Nemotron-CC pipeline which achieved +5 MMLU over Llama 3.1. Scores 0-5 based on educational quality, coherence, and informativeness.
**`reasoning_score`** measures density of reasoning markers per 1,000 words — causal connectives (because, since, due to), logical connectives (therefore, thus, hence, it follows), procedural markers (first, next, finally), and contrastive markers (however, on the other hand). Normalized to 0-1.
## Use cases
- **Pretraining data** — quality-filtered, deduplicated, multi-domain English text ready for LLM training
- **Fine-tuning** — use `topic` and `doc_type` to build domain-specific training sets
- **Synthetic data generation** — sample balanced subsets by quality, topic, or structure for LLM-generated annotations
- **Data quality research** — study how quality signals vary across web and domain-specific text
- **Retrieval/embedding training** — diverse document types and topics for broad coverage
The metadata columns let you filter precisely: high-quality medical Q&A, argumentative philosophy text, tutorial-style STEM content, etc.
## Limitations
- `edu_score` is biased toward academic/educational content — legal text and code score low despite containing strong reasoning
- Topic classification for web sources uses URL domain + keyword matching, not a trained classifier (hence 42.8% "general")
- English only
- Inherits biases from upstream sources (FineWeb, DCLM, The Pile, etc.)
## License
Released under **ODC-By** (Open Data Commons Attribution License).
## Citation
```bibtex
@dataset{diverse_source_3m,
title={Diverse Quality-Scored English Text},
author={blythet},
year={2026},
url={https://huggingface.co/datasets/blythet/diverse-source-3m}
}
```
提供机构:
blythet



