five

enPurified/reasoning-v1-20m-enPurified-openai-messages

收藏
Hugging Face2026-01-15 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/enPurified/reasoning-v1-20m-enPurified-openai-messages
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: other tags: - data-filtering - prose-only - high-quality - enPurified - synthetic - text-normalization pretty_name: enPurified Collection --- # enPurified Collection ## Dataset Overview This dataset is a pruned version of **https://huggingface.co/datasets/glaiveai/reasoning-v1-20m** The **enPurified** collection represents a rigorous effort to distill existing high-value datasets into their purest English prose form. While the open-source community provides excellent resources for code (e.g., StackV2) and mathematics (e.g., OpenMath), high-quality, noise-free English prose often remains buried under multimodal debris. The purpose of this collection is to apply a strict series of heuristic tests to remove coding languages, mathematical notation, foreign languages, and low-quality web text. The result is a corpus dedicated exclusively to high-quality reasoning and storytelling, formatted in the standard OpenAI messages schema for immediate use in training or fine-tuning. For source datasets containing large contiguous strings (such as novels or long-form articles), a preprocessing step utilizes LangChain to chunk text into paragraphs while preserving context, ensuring the output remains compatible with the message format. This heuristic pruning process reduced the dataset from **22,199,375** rows to **14,270,411** rows of high-quality English prose. ## Pruning Pipeline The dataset generation process employs a multi-stage strict filtering pipeline designed to aggressively cull low-quality or out-of-domain data. The pipeline is implemented via the following heuristics: ### 1. Normalization & Pre-Processing * **Format Standardization:** All data is converted to the OpenAI messages format (`{"messages": [...]}`). * **Tag Normalization:** Specialized tokens (e.g., solution blocks) are stripped, and reasoning tags (like `<|thought|>`) are standardized to `<think>`/`</think>`. * **Short Response Cull:** Assistant responses under 350 characters are immediately discarded to ensure only substantial reasoning or prose remains. ### 2. The "No-Code / No-Math" Gate * **Symbol Density Check:** Any text where code-like symbols (`{`, `}`, `[`, `]`, `;`, etc.) constitute more than **2.5%** of the character count is rejected. * **Code Line Detection:** If more than 15% of the lines end in code-syntax markers (`;`, `{`, `}`), the sample is dropped. * **Keyword Ban List:** Documents containing specific programming keywords (e.g., `def main():`, `import torch`, `std::`, `console.log`) are filtered out. * **Math Gate:** Text containing LaTeX delimiters (`$$`, `\[`, `\begin{equation}`) or excessive backslashes is removed to ensure the dataset focuses purely on natural language. ### 3. Structural Integrity & Syntax * **Strict Syntax:** Documents are filtered for length (100 - 400k characters) and checked for broken HTML or forbidden DOM tags. * **Quiz/MCQ Filter:** Heuristics detect and remove multiple-choice questions (e.g., distinct patterns of "Option A", "Option B") to prevent data contamination from simple evaluation benchmarks. * **Line Quality:** Texts where over 60% of lines are extremely short (<20 chars) are discarded to remove list-heavy or chat-log style data. ### 4. Literary Quality & Prose Identification * **Lexical Diversity (MTLD):** The pipeline calculates the **Measure of Textual Lexical Diversity (MTLD)**. Only documents with a score **≥ 80.0** are retained, ensuring a vocabulary richness comparable to academic writing or high-quality novels. * **English Prose Verification:** * **Stopword Density:** The text must have a stopword density (words like "the", "is", "of") greater than **27%**. This effectively filters out "keyword soup," log files, or non-standard English that lacks grammatical structure. * **ASCII Purity:** 95% of characters must be standard ASCII. * **Word Complexity:** The mean word length must fall between **4.25** and **11** characters, filtering out overly simplistic text or garbage data strings. ### 5. Safety & Repetition * **Repetition Check:** An N-gram analysis ensures unique trigrams constitute at least 50% of the text, removing repetitive loops common in generated data. * **Toxicity Filter:** A keyword-based safety check removes content containing explicit NSFW terms. ## License This dataset is a filtered derivative of existing open-source datasets. Users should **refer to the license of the original source dataset** (specified in the repository metadata or original card) to determine usage rights and attribution requirements. **https://huggingface.co/datasets/glaiveai/reasoning-v1-20m**
提供机构:
enPurified
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作