blythet/diverse-source-3m

Name: blythet/diverse-source-3m
Creator: blythet
Published: 2026-02-23 01:20:18
License: 暂无描述

Hugging Face2026-02-23 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/blythet/diverse-source-3m

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: odc-by task_categories: - text-generation - text-classification language: - en size_categories: - 1M<n<10M tags: - diverse - curated - deduplication - multi-domain - quality-scored - fineweb - pretraining - web-text - multi-source configs: - config_name: default data_files: - split: train path: data/diverse_text_3m.parquet pretty_name: Diverse Quality-Scored English Text (2.9M) dataset_info: features: - name: text dtype: string - name: id dtype: string - name: url dtype: string - name: source dtype: string - name: quality dtype: float32 - name: edu_score dtype: float32 - name: reasoning_score dtype: float32 - name: topic dtype: string - name: doc_type dtype: string - name: word_count dtype: int32 splits: - name: train num_examples: 2887868 --- # Diverse Quality-Scored English Text (2.9M) **2.9 million** quality-scored, deduplicated English documents from **12 sources** spanning STEM, law, medicine, math, code, news, philosophy, and general web text — each tagged with topic, document type, and three quality signals. ## The problem with existing datasets Most large-scale text datasets have a quality-diversity tradeoff: - **FineWeb EDU >= 4.0** scores high on benchmarks but is [heavily STEM/textbook biased](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier). The team themselves warn it *"might overfit to academic looking content."* Training on it improves MMLU/ARC but **degrades** commonsense benchmarks (HellaSwag, PIQA). - **Generic web crawls** (C4, DCLM) have breadth but no quality signal — you're filtering blind. - **Domain-specific corpora** (PubMed, FreeLaw) are high quality but narrow. This dataset solves that by blending all three types with **unified quality scoring** across every document, so you can filter by quality without losing diversity. ## What you get - **12 sources** with intentional balance — STEM is capped, not dominant - **3 quality signals** on every row: `quality` (combined 0-1), `edu_score` (0-5, Nemotron-4), `reasoning_score` (0-1) - **12 topic labels** and **7 document type labels** for precise filtering - **3-stage deduplication** (exact hash + URL + near-duplicate) across all sources - **Single parquet file**, ZSTD compressed, works with DuckDB/Pandas/HF datasets out of the box ## Quick start ```python from datasets import load_dataset ds = load_dataset("blythet/diverse-source-3m", split="train") # Top 25% by quality good = ds.filter(lambda x: x["quality"] >= 0.35) # Just medical text medical = ds.filter(lambda x: x["topic"] == "medicine") # Q&A format documents with strong reasoning qa_reasoning = ds.filter(lambda x: x["doc_type"] == "q_and_a" and x["reasoning_score"] >= 0.3) ``` DuckDB for fast analytics (no download required): ```python import duckdb df = duckdb.query(""" SELECT * FROM 'hf://datasets/blythet/diverse-source-3m/data/*.parquet' WHERE quality >= 0.35 AND topic = 'science' """).df() ``` ## Columns | Column | Type | Description | |---|---|---| | `text` | string | The document text | | `id` | string | Unique document identifier | | `url` | string | Source URL (null for domain-specific corpora like PubMed, FreeLaw, etc.) | | `source` | string | Which of the 12 source datasets this came from | | `quality` | float (0-1) | **Overall quality score.** Combines educational value and reasoning structure. Higher = better. Use this for filtering. | | `edu_score` | float (0-5) | Educational quality from [Nemotron-4 classifier](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier) — coherence, informativeness, writing quality. | | `reasoning_score` | float (0-1) | Density of reasoning structure (causal, logical, procedural markers). | | `topic` | string | Subject area — one of 12 categories (see below). | | `doc_type` | string | Document structure: `expository`, `q_and_a`, `tutorial`, `argument`, `explanation`, `narrative`, or `reference`. | | `word_count` | int | Word count | ### How `quality` is calculated ``` quality = 0.70 * (edu_score / 5.0) + 0.30 * reasoning_score ``` A single 0-1 number balancing educational value (70%) with reasoning structure (30%). The raw scores are included if you want to weight them differently. ## Sources | Source | Count | Quality | Edu | Reasoning | Avg Words | What it is | |---|---|---|---|---|---|---| | `fineweb_edu_broad` | 818,836 | 0.41 | 2.46 | 0.22 | 634 | FineWeb EDU score 3.0-3.99 — broad web text at the dev-recommended quality threshold | | `dclm_baseline` | 575,000 | 0.27 | 1.47 | 0.21 | 349 | DCLM-baseline — commonsense and explanatory text (ELI5+OpenHermes quality signal) | | `fineweb_edu_high` | 291,751 | 0.47 | 2.87 | 0.22 | 595 | FineWeb EDU score >= 4.0 — high-quality STEM, **capped** to prevent over-representation | | `pile_pubmed` | 242,501 | 0.28 | 1.50 | 0.23 | 184 | PubMed abstracts — hypothesis, evidence, conclusion format | | `pile_freelaw` | 200,000 | 0.09 | 0.24 | 0.18 | 1,763 | Court opinions — natural chains of legal reasoning | | `pile_wikipedia` | 164,775 | 0.27 | 1.59 | 0.16 | 376 | Wikipedia EN — history, arts, social science, geography | | `pile_stackexchange` | 162,083 | 0.22 | 1.05 | 0.26 | 251 | StackExchange — problem, diagnosis, solution across 170+ communities | | `open_web_math` | 150,000 | 0.28 | 1.49 | 0.24 | 845 | Mathematical content, proofs, and derivations | | `ccnews` | 125,000 | 0.18 | 0.82 | 0.21 | 432 | CC-News — journalism and current events | | `the_stack_code` | 123,533 | 0.12 | 0.75 | 0.05 | 280 | Source code in Python, JavaScript, Rust, and Go | | `philpapers` | 20,246 | 0.21 | 1.26 | 0.13 | 3,874 | Academic philosophy papers | | `sec_finance` | 14,143 | 0.17 | 1.10 | 0.06 | 3,144 | SEC financial filings | > **Note on FreeLaw's low edu_score:** Court opinions score 0.24 on educational quality because the Nemotron-4 classifier penalizes legal boilerplate, but they contain strong natural reasoning chains. The combined `quality` score accounts for both signals. ## Topics | Topic | Count | % | Description | |---|---|---|---| | general | 1,237,260 | 42.8% | Broad web text not matching a specific domain | | technology | 393,775 | 13.6% | Software, hardware, programming, IT | | medicine | 356,307 | 12.3% | Clinical, biomedical, health | | law | 210,363 | 7.3% | Legal opinions, statutes, case law | | encyclopedia | 172,377 | 6.0% | Wikipedia-style reference and general knowledge | | mathematics | 161,237 | 5.6% | Proofs, equations, mathematical reasoning | | news | 127,533 | 4.4% | Journalism and current events | | education | 77,122 | 2.7% | Teaching materials, courses, curricula | | history | 58,940 | 2.0% | Historical events, periods, civilizations | | science | 44,808 | 1.6% | Natural sciences, experiments, research | | finance | 25,707 | 0.9% | Markets, investments, financial filings | | philosophy | 22,439 | 0.8% | Ethics, epistemology, philosophical argument | Topics are assigned by source for domain-specific corpora (e.g., all FreeLaw docs are `law`) and by URL domain + keyword classification for web sources. ## Document Types | Type | Count | % | Description | |---|---|---|---| | expository | 2,237,151 | 77.5% | Informational prose — articles, explanations, descriptions | | q_and_a | 465,359 | 16.1% | Question-and-answer format | | explanation | 60,688 | 2.1% | Explicit explanatory structure ("this means...", "for example...") | | argument | 58,594 | 2.0% | Argumentative structure ("therefore...", "it follows that...") | | tutorial | 45,020 | 1.6% | Step-by-step instructions | | narrative | 19,874 | 0.7% | Story-like structure with characters and events | | reference | 1,182 | 0.0% | Dictionary/encyclopedia definitions | ## Deduplication Three-stage deduplication across all 12 sources: 1. **Exact text hash** — MD5 of normalized text (lowercased, whitespace-collapsed). Removed 84,128 duplicates (2.7%). 2. **URL dedup** — Normalized URL matching. Removed 44,846. 3. **Anchor-pair near-dedup** — Documents sharing 2 of 3 anchor hashes (first/middle/last 500 chars) are near-duplicates. Removed 7,570. When duplicates appeared across sources, specialized corpora (FreeLaw, PubMed) were kept over generic web text. ## Quality scoring details **`edu_score`** comes from [`nvidia/nemocurator-fineweb-nemotron-4-edu-classifier`](https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier) — trained on Nemotron-4-340B-Instruct annotations. This is the same classifier used in the Nemotron-CC pipeline which achieved +5 MMLU over Llama 3.1. Scores 0-5 based on educational quality, coherence, and informativeness. **`reasoning_score`** measures density of reasoning markers per 1,000 words — causal connectives (because, since, due to), logical connectives (therefore, thus, hence, it follows), procedural markers (first, next, finally), and contrastive markers (however, on the other hand). Normalized to 0-1. ## Use cases - **Pretraining data** — quality-filtered, deduplicated, multi-domain English text ready for LLM training - **Fine-tuning** — use `topic` and `doc_type` to build domain-specific training sets - **Synthetic data generation** — sample balanced subsets by quality, topic, or structure for LLM-generated annotations - **Data quality research** — study how quality signals vary across web and domain-specific text - **Retrieval/embedding training** — diverse document types and topics for broad coverage The metadata columns let you filter precisely: high-quality medical Q&A, argumentative philosophy text, tutorial-style STEM content, etc. ## Limitations - `edu_score` is biased toward academic/educational content — legal text and code score low despite containing strong reasoning - Topic classification for web sources uses URL domain + keyword matching, not a trained classifier (hence 42.8% "general") - English only - Inherits biases from upstream sources (FineWeb, DCLM, The Pile, etc.) ## License Released under **ODC-By** (Open Data Commons Attribution License). ## Citation ```bibtex @dataset{diverse_source_3m, title={Diverse Quality-Scored English Text}, author={blythet}, year={2026}, url={https://huggingface.co/datasets/blythet/diverse-source-3m} } ```

提供机构：

blythet

5,000+

优质数据集

54 个

任务类型

进入经典数据集