enPurified/SYNTH-enPurified-openai-messages
收藏Hugging Face2026-01-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/enPurified/SYNTH-enPurified-openai-messages
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: other
task_categories:
- text-generation
tags:
- synthetic
- prose
- quality-filtering
- text-normalization
- enPurified
pretty_name: enPurified Prose Collection
size_categories:
- 100K<n<1M
---
# enPurified Prose Collection
This is derivative work of https://huggingface.co/datasets/PleIAs/SYNTH
## Curatorial Objective
The **enPurified** collection is a curated initiative designed to distill high-value, existing datasets into their purest English prose form. The primary objective is to create a corpus strictly dedicated to high-quality linguistic reasoning and narrative flow, explicitly excluding domain-specific notations that often dilute prose models.
Unlike general-purpose datasets, this collection enforces a strict **No Math, No Code, No Low-Quality Text** policy. While high-quality datasets exist for programming and mathematics, enPurified targets the nuance of English syntax and vocabulary.
### Data Processing Strategy
* **Format Standardization:** All data is converted into the OpenAI `messages` format for seamless integration into training pipelines.
* **Long-Context Handling:** For sources involving extensive text strings (e.g., *LongPage* or *StandardEbooks*), a specialized LangChain implementation is utilized to segment narratives into coherent paragraphs, paired with relevant instructional headers in the `messages` format.
* **Synthetic Filtering:** For synthetic datasets (e.g., `PleIAs/SYNTH`), a rigorous heuristic pipeline is applied to strip robotic artifacts and ensure lexical diversity.
This enPurification process trimmed down the Synth dataset from **68,028,365** rows down to **9,323,958** rows of high quality English prose. Math, code, and unnecessary synthetic artifacts were stripped out as much as possible.
---
## Heuristic Pruning Pipeline (The Gauntlet)
The dataset is generated via a multi-stage filtering script designed to aggressively prune low-quality or out-of-domain entries. The pipeline operates as follows:
### 1. Artifact Cleaning & Normalization
Before analysis, raw text undergoes cleaning to remove "robotic garbage" and meta-commentary often found in synthetic chain-of-thought data.
* **Removal of Meta-Tags:** Strips patterns like `[Stream:]`, `Analysis:`, `NB:`, and excessive markdown headers.
* **Whitespace Normalization:** Collapses distinctive whitespace to standard prose spacing.
### 2. Structural & Syntax Gating
Entries are rejected if they exhibit structural characteristics typical of code, listicles, or low-effort responses.
* **Lazy Thought Detection:** Rejects entries where the "reasoning" component is disproportionately short compared to the answer (Logic: < 10% ratio for long answers).
* **Bullet Point Density:** Discards text where bullet points or numbered lists constitute >25% of the content (or >65% of the thought block) to force narrative prose.
* **Line Structure:** Filters out "vertical" text where short lines (<30 chars) make up >25% of the document.
### 3. Domain Exclusion (Math & Code)
A strict set of checks ensures the complete removal of non-prose content.
* **Symbol Density:** Rejects text where code-specific symbols (`{`, `}`, `<`, `>`) exceed 3.3% of characters.
* **Math Gate:** Detects and rejects LaTeX patterns (`$$...$$`, `\begin{...}`) and variable assignments.
* **Code Syntax:** Identifies camelCase variables, function definitions (`def`, `void`), and lines ending in code-typical syntax (`;`, `{`).
* **Banned Substrings:** Immediate rejection of strings containing HTML doctypes, specific imports (`import matplotlib`), or memory addresses.
### 4. Quality & Safety Heuristics
Final quality control focuses on the linguistic richness of the remaining text.
* **English Prose ID:** Verifies English stopword density (>14%) and ASCII character ratios (>98%) to filter foreign languages and garbled text.
* **Lexical Richness (MTLD):** Calculates the *Measure of Textual Lexical Diversity*. Only texts with an MTLD score of **80+** are retained, ensuring a sophisticated vocabulary.
* **MCQ Filter:** Removes "Multiple Choice Question" formats.
* **Toxicity:** Basic keyword filtering for NSFW or toxic content.
---
## Output Format
The final dataset is saved in JSONL format using the standard OpenAI schema. Synthetic reasoning (Chain-of-Thought) is preserved but encapsulated within `<think>` tags inside the assistant's response.
```json
{
"messages": [
{
"role": "user",
"content": "User query here..."
},
{
"role": "assistant",
"content": "<think>\nSanitized reasoning process...\n</think>\n\nHigh-quality prose answer..."
}
]
}
```
## License
This dataset is a derivative work. Please refer to the license of the original source dataset used for this subset. Users are responsible for complying with the original licensing terms.
提供机构:
enPurified



