enPurified/smollm-corpus-cosmopedia-v2-enPurified-openai-messages

Name: enPurified/smollm-corpus-cosmopedia-v2-enPurified-openai-messages
Creator: enPurified
Published: 2026-01-18 06:20:51
License: 暂无描述

Hugging Face2026-01-18 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/enPurified/smollm-corpus-cosmopedia-v2-enPurified-openai-messages

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en tags: - enPurified - data-filtering - quality-control - prose - synthetic - text-generation license: other pretty_name: enPurified Cosmopedia V2 base_model: - smollm-corpus-cosmopedia-v2 task_categories: - text-generation size_categories: - 100K<n<1M --- # enPurified Collection: Smollm Corpus Cosmopedia V2] Updated on January 15th to remove more math, code, and low quality English. The dataset has now been pruned from **39.1M** rows down to ~**9M** rows. ## Purpose of the enPurified Collection The **enPurified** dataset collection is an initiative to curate strict, high-quality English prose datasets for language modeling. While the open-source community provides extensive resources for code, mathematics, and multilingual data, this collection isolates the "prose" modality to facilitate the training of models with superior linguistic capabilities. The primary objective is to take existing high-value datasets (in this case, `smollm-corpus-cosmopedia-v2`) and subject them to a rigorous heuristic testing suite. This pipeline systematically strips away coding challenges, complex mathematical notation, low-quality text, foreign languages, and non-standard syntax. The resulting data is purely English narrative and informational text, converted into the OpenAI messages format for immediate compatibility with modern training frameworks. ## Pruning Pipeline & Heuristics The generation of this dataset involves a deterministic filtering pipeline designed to enforce structural integrity and lexical richness. The process avoids model-based filtering to minimize hallucination bias, relying instead on strict statistical and regex-based heuristics. The pipeline applies the following filters in sequence: ### 1. Pre-Processing & Normalization * **Message Format Standardization:** All data is converted to the OpenAI messages format (System/User/Assistant). * **Tag Normalization:** Normalizes `<think>` tags and removes specific solution blocks (`<|begin_of_solution|>`) to prevent the leakage of reasoning trace artifacts where inappropriate. * **Short Response Pruning:** Discards assistant responses shorter than 20 characters to eliminate low-information turns. ### 2. The Content Exclusion Gauntlet Data must pass **all** of the following gates to remain in the dataset: * **Symbol Density Check (Code Removal):** Calculates the ratio of code-specific symbols (`{`, `}`, `;`, `//`, etc.). Texts with a symbol density > 5% are rejected. This effectively filters out source code, stack traces, and JSON/XML dumps. * **Math Gate:** Utilization of regex to detect heavy LaTeX blocks (`$$...$$` or `\[...\]`). While currency symbols are permitted, complex equation blocks and high backslash density are filtered to maintain the prose focus. * **Quiz & MCQ Filter:** Detects and removes Multiple Choice Question structures (e.g., "Option A", "Option B") to prevent the model from learning test-taking artifacts rather than conceptual understanding. * **Strict Syntax Gate:** * **Length Constraints:** Documents must be between 100 and 400,000 characters. * **Banned Substrings:** Hard filters for programming keywords (e.g., `std::`, `console.log`, `public static void`) and HTML boilerplate (`<!DOCTYPE html>`). * **Malform HTML:** Rejects text containing broken or non-semantic HTML tags. ### 3. Structural Integrity Analysis * **Short Line Density:** Rejects documents where >80% of lines are under 20 characters (detects lists, poetry, or bad OCR). * **Repetition Detection:** * **Line-Level:** Rejects documents where >30% of significant lines are exact duplicates. * **Token-Level:** Calculates N-gram uniqueness. Documents where unique N-grams constitute <50% of the text are flagged as degenerate repetition. ### 4. Linguistic Quality & Safety * **English Prose Identification:** * **Stopword Density:** Verifies the presence of common English stopwords ("the", "and", "is") at a rate >20% to ensure the text is natural language rather than logs or encoded data. * **ASCII Compliance:** Requires >95% standard ASCII characters to filter foreign scripts and encoding errors. * **Quality Heuristics:** Enforces a mean word length between 3.5 and 11 characters. This removes "gibberish" (e.g., "asdfjasdf") and overly technical strings. * **Toxicity Filter:** A lightweight keyword scan checks for explicit content. If the density of banned terms exceeds 0.5%, the document is discarded. * **Lexical Richness (MTLD):** Computes the Measure of Textual Lexical Diversity (MTLD). Only texts with a bidirectional score >= 55.0 are retained, ensuring a high vocabulary range and preventing repetitive phrasing. ## License Please refer to the license of the original source dataset found on Hugging Face (`smollm-corpus-cosmopedia-v2`). This repository provides a filtered derivative and makes no claims of ownership over the underlying data.

提供机构：

enPurified

5,000+

优质数据集

54 个

任务类型

进入经典数据集