five

enPurified/SYNTH-enPurified-openai-messages

收藏
Hugging Face2026-01-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/enPurified/SYNTH-enPurified-openai-messages
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: other task_categories: - text-generation tags: - synthetic - prose - quality-filtering - text-normalization - enPurified pretty_name: enPurified Prose Collection size_categories: - 100K<n<1M --- # enPurified Prose Collection This is derivative work of https://huggingface.co/datasets/PleIAs/SYNTH ## Curatorial Objective The **enPurified** collection is a curated initiative designed to distill high-value, existing datasets into their purest English prose form. The primary objective is to create a corpus strictly dedicated to high-quality linguistic reasoning and narrative flow, explicitly excluding domain-specific notations that often dilute prose models. Unlike general-purpose datasets, this collection enforces a strict **No Math, No Code, No Low-Quality Text** policy. While high-quality datasets exist for programming and mathematics, enPurified targets the nuance of English syntax and vocabulary. ### Data Processing Strategy * **Format Standardization:** All data is converted into the OpenAI `messages` format for seamless integration into training pipelines. * **Long-Context Handling:** For sources involving extensive text strings (e.g., *LongPage* or *StandardEbooks*), a specialized LangChain implementation is utilized to segment narratives into coherent paragraphs, paired with relevant instructional headers in the `messages` format. * **Synthetic Filtering:** For synthetic datasets (e.g., `PleIAs/SYNTH`), a rigorous heuristic pipeline is applied to strip robotic artifacts and ensure lexical diversity. This enPurification process trimmed down the Synth dataset from **68,028,365** rows down to **9,323,958** rows of high quality English prose. Math, code, and unnecessary synthetic artifacts were stripped out as much as possible. --- ## Heuristic Pruning Pipeline (The Gauntlet) The dataset is generated via a multi-stage filtering script designed to aggressively prune low-quality or out-of-domain entries. The pipeline operates as follows: ### 1. Artifact Cleaning & Normalization Before analysis, raw text undergoes cleaning to remove "robotic garbage" and meta-commentary often found in synthetic chain-of-thought data. * **Removal of Meta-Tags:** Strips patterns like `[Stream:]`, `Analysis:`, `NB:`, and excessive markdown headers. * **Whitespace Normalization:** Collapses distinctive whitespace to standard prose spacing. ### 2. Structural & Syntax Gating Entries are rejected if they exhibit structural characteristics typical of code, listicles, or low-effort responses. * **Lazy Thought Detection:** Rejects entries where the "reasoning" component is disproportionately short compared to the answer (Logic: < 10% ratio for long answers). * **Bullet Point Density:** Discards text where bullet points or numbered lists constitute >25% of the content (or >65% of the thought block) to force narrative prose. * **Line Structure:** Filters out "vertical" text where short lines (<30 chars) make up >25% of the document. ### 3. Domain Exclusion (Math & Code) A strict set of checks ensures the complete removal of non-prose content. * **Symbol Density:** Rejects text where code-specific symbols (`{`, `}`, `<`, `>`) exceed 3.3% of characters. * **Math Gate:** Detects and rejects LaTeX patterns (`$$...$$`, `\begin{...}`) and variable assignments. * **Code Syntax:** Identifies camelCase variables, function definitions (`def`, `void`), and lines ending in code-typical syntax (`;`, `{`). * **Banned Substrings:** Immediate rejection of strings containing HTML doctypes, specific imports (`import matplotlib`), or memory addresses. ### 4. Quality & Safety Heuristics Final quality control focuses on the linguistic richness of the remaining text. * **English Prose ID:** Verifies English stopword density (>14%) and ASCII character ratios (>98%) to filter foreign languages and garbled text. * **Lexical Richness (MTLD):** Calculates the *Measure of Textual Lexical Diversity*. Only texts with an MTLD score of **80+** are retained, ensuring a sophisticated vocabulary. * **MCQ Filter:** Removes "Multiple Choice Question" formats. * **Toxicity:** Basic keyword filtering for NSFW or toxic content. --- ## Output Format The final dataset is saved in JSONL format using the standard OpenAI schema. Synthetic reasoning (Chain-of-Thought) is preserved but encapsulated within `<think>` tags inside the assistant's response. ```json { "messages": [ { "role": "user", "content": "User query here..." }, { "role": "assistant", "content": "<think>\nSanitized reasoning process...\n</think>\n\nHigh-quality prose answer..." } ] } ``` ## License This dataset is a derivative work. Please refer to the license of the original source dataset used for this subset. Users are responsible for complying with the original licensing terms.
提供机构:
enPurified
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作