codelion/sutra-improved-100M

Name: codelion/sutra-improved-100M
Creator: codelion
Published: 2026-03-29 10:40:35
License: 暂无描述

Hugging Face2026-03-29 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/codelion/sutra-improved-100M

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: apache-2.0 size_categories: - 100K<n<1M task_categories: - text-generation tags: - pretraining - educational - pedagogical - synthetic - sutra - multi-domain - self-improvement pretty_name: Sutra Improved 100M --- # Sutra Improved 100M A self-improved pedagogical dataset for LLM pretraining, containing **413,899 entries** totaling **110,038,011 tokens (~110 million)**. This dataset was created by applying an iterative self-improvement process to the [Sutra-10B](https://huggingface.co/datasets/codelion/sutra-10B) dataset, where each sample was rewritten using [Gemma-3-4B-IT](https://huggingface.co/google/gemma-3-4b-it) and only the better version (original or rewritten) was kept, followed by comprehensive deduplication and quality filtering. ## Dataset Description This dataset explores **self-improvement** as a data curation strategy for pedagogical pretraining. Rather than generating new content from scratch, we take existing educational text from Sutra-10B and attempt to improve it through targeted rewriting. The pipeline processed the first ~526K samples from the Sutra-10B dataset (which contains 10,193,029 entries total) sequentially, then applied deduplication and quality filtering to produce the final clean dataset. Each sample undergoes the following process: 1. **Prefix-suffix splitting**: The text is tokenized using a GPT-2 tokenizer. The first 128 tokens form the prefix (context), and the next 128 tokens form the suffix (target for improvement). 2. **Rewriting**: The suffix is rewritten by Gemma-3-4B-IT with instructions to make it more accurate and educational, conditioned on the prefix as context. 3. **Quality scoring**: Both the original and rewritten suffixes are scored using a heuristic quality metric based on vocabulary diversity and sentence completion. 4. **Selection**: The higher-scoring version is kept. The prefix and best suffix are concatenated to form the final text. 5. **Cleaning**: The dataset is post-processed to remove exact duplicates, near-duplicates (matching first 200 characters), short entries (<200 characters), and boilerplate content. This approach ensures that the dataset is never worse than the original — it can only stay the same or improve. ## Dataset Statistics | Metric | Value | |--------|-------| | Total Entries | 413,899 | | Total Tokens | 110,038,011 (~110M) | | Avg Tokens/Entry | 266 | | Improved (rewritten kept) | 114,295 (27.6%) | | Original kept | 299,604 (72.4%) | | Source Dataset | [codelion/sutra-10B](https://huggingface.co/datasets/codelion/sutra-10B) | | Rewriting Model | [Gemma-3-4B-IT](https://huggingface.co/google/gemma-3-4b-it) | | Tokenizer | GPT-2 (tiktoken) | ### Data Cleaning | Step | Removed | Remaining | |------|---------|-----------| | Raw output | — | 525,920 | | Short entries (<200 chars) | 521 | 525,399 | | Boilerplate content | 322 | 525,077 | | Exact duplicates | 60,546 | 464,531 | | Near-duplicates (first 200 chars) | 50,632 | 413,899 | | **Final** | **112,021 (21.3%)** | **413,899** | ### Skill Distribution | Skill | Count | Percentage | |-------|-------|------------| | unknown | 269,306 | 65.1% | | science_arc | 47,794 | 11.5% | | reading_boolq | 30,121 | 7.3% | | factual_truthfulqa | 24,689 | 6.0% | | procedural_piqa | 17,927 | 4.3% | | qa_general | 11,315 | 2.7% | | math_gsm8k | 7,459 | 1.8% | | narrative_hellaswag | 3,926 | 0.9% | | general | 1,362 | 0.3% | ## Self-Improvement Pipeline The self-improvement pipeline is implemented in a single Python script (`scripts/self_improve.py`) with the following key design decisions: - **Prefix/Suffix Split**: 128 tokens prefix + 128 tokens suffix using GPT-2 tokenizer. Texts shorter than 256 tokens are skipped. - **Rewriting Prompt**: A system prompt instructs the model to act as an expert editor, rewriting text to be more accurate and educational. Only the suffix is rewritten, preserving the original context. - **Quality Heuristic**: A lightweight scoring function that evaluates vocabulary diversity (ratio of unique words) and sentence completion (ending punctuation). This enables fast, API-free comparison. - **Parallel Processing**: 4 concurrent workers with automatic retry logic for API failures. - **Resume Capability**: The pipeline automatically resumes from where it left off based on output file line count, enabling long-running generation across multiple sessions. - **Streaming**: The source dataset is loaded in streaming mode to handle the 10B+ token source without requiring full download. ### Rewriting Model The rewriting was performed using **Gemma-3-4B-IT** served via a local llama.cpp-compatible API endpoint. The model was chosen for its balance of quality and throughput at the 4B parameter scale, enabling cost-effective rewriting of hundreds of thousands of samples. ## Data Fields Each entry contains 4 fields: | Field | Type | Description | |-------|------|-------------| | `text` | string | The final text (prefix + best suffix) | | `source` | string | Whether the best suffix was `"original"` or `"rewritten"` | | `skill` | string | Skill category from the source dataset | | `improved` | boolean | `true` if the rewritten version was selected | ## Example Entries ### Rewritten (improved) entry ```json { "text": "The use of passive biocathodes could potentially hold the key to producing an environmentally sustainable approach for achieving combined waste water treatment and water desalinization... Microbial desalination cells (MDCs) represent a recent technological advancement where wastewater treatment and desalination occur concurrently within bioelectrochemical systems.", "source": "rewritten", "skill": "science_arc", "improved": true } ``` ### Original (kept) entry ```json { "text": "On December 2, 1943, Germany launched an air attack on the Italian town of Bari on the Adriatic coast. The town was important strategically as it was a major shipping port...", "source": "original", "skill": "narrative_hellaswag", "improved": false } ``` ## Usage ```python from datasets import load_dataset # Load the full dataset ds = load_dataset("codelion/sutra-improved-100M", split="train") # Stream for large-scale training ds = load_dataset("codelion/sutra-improved-100M", split="train", streaming=True) # Filter to only improved samples improved_ds = ds.filter(lambda x: x["improved"] == True) # Filter by skill science_ds = ds.filter(lambda x: x["skill"] == "science_arc") ``` ## Intended Use This dataset is designed for: - **LLM Pretraining**: Self-improved educational content for foundational model training - **Data Curation Research**: Studying self-improvement as a data quality strategy - **Pedagogical AI**: Exploring how small models can improve educational text - **Ablation Studies**: Comparing original vs. self-improved data for pretraining ## Related Datasets - [sutra-10B](https://huggingface.co/datasets/codelion/sutra-10B): 10B token source dataset (parent) - [sutra-1B](https://huggingface.co/datasets/codelion/sutra-1B): 1B token pretraining dataset - [sutra-100M](https://huggingface.co/datasets/codelion/sutra-100M): 100M token subset - [sutra-10M](https://huggingface.co/datasets/codelion/sutra-10M): 10M token seed dataset - [sutra-30k-seeds](https://huggingface.co/datasets/codelion/sutra-30k-seeds): Seed concepts for knowledge graph - [sutra-magpie-sft](https://huggingface.co/datasets/codelion/sutra-magpie-sft): SFT dataset for instruction tuning ## Citation ```bibtex @article{sharma2026sutra, title={Scaling Pedagogical Pretraining: From Optimal Mixing to 10 Billion Tokens}, author={Sharma, Asankhaya}, year={2026}, url={https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens} } ``` ## License Apache 2.0

提供机构：

codelion

5,000+

优质数据集

54 个

任务类型

进入经典数据集