enPurified/reasoning-v1-20m-enPurified-openai-messages
收藏Hugging Face2026-01-15 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/enPurified/reasoning-v1-20m-enPurified-openai-messages
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: other
tags:
- data-filtering
- prose-only
- high-quality
- enPurified
- synthetic
- text-normalization
pretty_name: enPurified Collection
---
# enPurified Collection
## Dataset Overview
This dataset is a pruned version of **https://huggingface.co/datasets/glaiveai/reasoning-v1-20m**
The **enPurified** collection represents a rigorous effort to distill existing high-value datasets into their purest English prose form. While the open-source community provides excellent resources for code (e.g., StackV2) and mathematics (e.g., OpenMath), high-quality, noise-free English prose often remains buried under multimodal debris.
The purpose of this collection is to apply a strict series of heuristic tests to remove coding languages, mathematical notation, foreign languages, and low-quality web text. The result is a corpus dedicated exclusively to high-quality reasoning and storytelling, formatted in the standard OpenAI messages schema for immediate use in training or fine-tuning.
For source datasets containing large contiguous strings (such as novels or long-form articles), a preprocessing step utilizes LangChain to chunk text into paragraphs while preserving context, ensuring the output remains compatible with the message format.
This heuristic pruning process reduced the dataset from **22,199,375** rows to **14,270,411** rows of high-quality English prose.
## Pruning Pipeline
The dataset generation process employs a multi-stage strict filtering pipeline designed to aggressively cull low-quality or out-of-domain data. The pipeline is implemented via the following heuristics:
### 1. Normalization & Pre-Processing
* **Format Standardization:** All data is converted to the OpenAI messages format (`{"messages": [...]}`).
* **Tag Normalization:** Specialized tokens (e.g., solution blocks) are stripped, and reasoning tags (like `<|thought|>`) are standardized to `<think>`/`</think>`.
* **Short Response Cull:** Assistant responses under 350 characters are immediately discarded to ensure only substantial reasoning or prose remains.
### 2. The "No-Code / No-Math" Gate
* **Symbol Density Check:** Any text where code-like symbols (`{`, `}`, `[`, `]`, `;`, etc.) constitute more than **2.5%** of the character count is rejected.
* **Code Line Detection:** If more than 15% of the lines end in code-syntax markers (`;`, `{`, `}`), the sample is dropped.
* **Keyword Ban List:** Documents containing specific programming keywords (e.g., `def main():`, `import torch`, `std::`, `console.log`) are filtered out.
* **Math Gate:** Text containing LaTeX delimiters (`$$`, `\[`, `\begin{equation}`) or excessive backslashes is removed to ensure the dataset focuses purely on natural language.
### 3. Structural Integrity & Syntax
* **Strict Syntax:** Documents are filtered for length (100 - 400k characters) and checked for broken HTML or forbidden DOM tags.
* **Quiz/MCQ Filter:** Heuristics detect and remove multiple-choice questions (e.g., distinct patterns of "Option A", "Option B") to prevent data contamination from simple evaluation benchmarks.
* **Line Quality:** Texts where over 60% of lines are extremely short (<20 chars) are discarded to remove list-heavy or chat-log style data.
### 4. Literary Quality & Prose Identification
* **Lexical Diversity (MTLD):** The pipeline calculates the **Measure of Textual Lexical Diversity (MTLD)**. Only documents with a score **≥ 80.0** are retained, ensuring a vocabulary richness comparable to academic writing or high-quality novels.
* **English Prose Verification:**
* **Stopword Density:** The text must have a stopword density (words like "the", "is", "of") greater than **27%**. This effectively filters out "keyword soup," log files, or non-standard English that lacks grammatical structure.
* **ASCII Purity:** 95% of characters must be standard ASCII.
* **Word Complexity:** The mean word length must fall between **4.25** and **11** characters, filtering out overly simplistic text or garbage data strings.
### 5. Safety & Repetition
* **Repetition Check:** An N-gram analysis ensures unique trigrams constitute at least 50% of the text, removing repetitive loops common in generated data.
* **Toxicity Filter:** A keyword-based safety check removes content containing explicit NSFW terms.
## License
This dataset is a filtered derivative of existing open-source datasets. Users should **refer to the license of the original source dataset** (specified in the repository metadata or original card) to determine usage rights and attribution requirements.
**https://huggingface.co/datasets/glaiveai/reasoning-v1-20m**
提供机构:
enPurified



