blythet/diverse-2.5m

Name: blythet/diverse-2.5m
Creator: blythet
Published: 2026-02-22 02:46:27
License: 暂无描述

Hugging Face2026-02-22 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/blythet/diverse-2.5m

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: odc-by task_categories: - text-generation language: - en size_categories: - 1M<n<10M tags: - diverse - curated - deduplication - multi-domain - stem - legal - scientific - encyclopedic - source-text configs: - config_name: default data_files: - split: train path: cot_diverse_2.5m.parquet pretty_name: Diverse Source Text Dataset (2.5M) dataset_info: features: - name: text dtype: string - name: id dtype: string - name: url dtype: string - name: source dtype: string - name: quality_score dtype: float64 splits: - name: train num_examples: 2500000 --- # Diverse Source Text Dataset (2.5M) A curated, deduplicated, multi-domain English text dataset blending 7 sources across STEM, legal, scientific, encyclopedic, Q&A, and general knowledge domains. Designed as high-quality, diverse source material for downstream NLP tasks such as synthetic data generation, fine-tuning, and text analysis. ## Dataset Summary | | | |---|---| | **Total samples** | 2,500,000 | | **Estimated tokens** | ~2.8B (GPT-2) / ~2.4B (modern tokenizers) | | **Language** | English | | **Format** | Parquet (ZSTD compressed) | | **File size** | 4.28 GB | | **Text length** | 200 - 50,000 characters | | **Mean length** | 4,656 characters (~1,107 tokens) | | **Median length** | 2,439 characters | ## Source Breakdown | Source | Samples | Share | Avg Chars | Avg Tok/Doc | Quality Score | Domain | |--------|--------:|------:|----------:|------------:|--------------:|--------| | FineWeb EDU (broad, 3.0-4.0) | 750,000 | 30% | 4,997 | 1,063 | 3.39 | General educational | | DCLM-baseline | 500,000 | 20% | 2,295 | 572 | 0.89 | Commonsense / explanatory | | FineWeb EDU (high, >= 4.0) | 375,000 | 15% | 4,923 | 1,023 | 4.18 | STEM / high-quality educational | | Pile - FreeLaw | 250,000 | 10% | 14,458 | 3,781 | N/A | Legal (court opinions, filings) | | Pile - PubMed Abstracts | 250,000 | 10% | 1,335 | 292 | N/A | Biomedical / scientific | | Pile - StackExchange | 200,000 | 8% | 2,190 | 761 | N/A | Technical Q&A | | Pile - Wikipedia (en) | 175,000 | 7% | 2,923 | 685 | N/A | Encyclopedic | ## Schema ``` text: string # The document text (200-50,000 chars) id: string # Unique document identifier from source url: string # Source URL (null for Pile sources) source: string # One of 7 source labels quality_score: float64 # Source-specific quality score (null for Pile sources) ``` ## Methodology ### Collection - **FineWeb EDU**: Streamed from [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) across 12 Common Crawl dumps, filtered by educational quality score - **DCLM-baseline**: Streamed from [mlfoundations/dclm-baseline-1.0](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) with fasttext quality score >= 0.65 - **Pile subsets**: Streamed from [monology/pile-uncopyrighted](https://huggingface.co/datasets/monology/pile-uncopyrighted), filtered by subset name ### Filtering - Minimum 200 characters, maximum 50,000 characters - 20% over-fetch to absorb deduplication losses ### Deduplication (3-stage) 1. **Exact text dedup**: MD5 hash of normalized text (lowercased, whitespace-collapsed) - removed 70,433 (2.3%) 2. **URL dedup**: Normalized URL matching - removed 19,283 3. **Near-dedup (anchor pairs)**: Three passes using MD5 hashes of text start/mid/end 500-char anchors - removed 3,353 Total removed: 93,069 / 3,000,000 (3.1%) ### Final Assembly - Each source trimmed to exact target count, prioritizing highest quality scores - Globally shuffled via deterministic hash (seed=42) - Written as single Parquet file with ZSTD compression ## Usage ```python from datasets import load_dataset ds = load_dataset("blythet/diverse-2.5m", split="train") print(ds) # Dataset({ # features: ['text', 'id', 'url', 'source', 'quality_score'], # num_rows: 2500000 # }) # Filter by source stem = ds.filter(lambda x: x["source"] == "fineweb_edu_high") # Filter by quality high_quality = ds.filter(lambda x: x["quality_score"] is not None and x["quality_score"] >= 4.0) ``` ## Intended Use This dataset provides high-quality, diverse English text suitable for: - Synthetic data generation (e.g., chain-of-thought, instruction tuning) - Fine-tuning language models across multiple domains - Text analysis and NLP research - Domain-specific data extraction (legal, scientific, educational, technical) The domain diversity covers STEM, legal reasoning, scientific literature, technical Q&A, encyclopedic knowledge, and general commonsense explanations. ## Limitations - Quality scores are only available for FineWeb EDU and DCLM sources; Pile subsets have `null` quality scores - URLs are only available for FineWeb EDU and DCLM sources - Text is English-only - The dataset inherits any biases present in the upstream sources ## License This dataset is released under **ODC-By** (Open Data Commons Attribution License), consistent with the upstream source licenses: - FineWeb EDU: ODC-By - DCLM-baseline: ODC-By - Pile (uncopyrighted subsets): Public domain / permissive ## Citation ```bibtex @dataset{diverse_2.5m, title={Diverse Source Text Dataset}, author={blythet}, year={2025}, url={https://huggingface.co/datasets/blythet/diverse-2.5m}, note={2.5M curated, deduplicated multi-domain English texts} } ```

提供机构：

blythet

5,000+

优质数据集

54 个

任务类型

进入经典数据集