five

enPurified/smollm-corpus-fineweb-edu-enPurified-openai-messages

收藏
Hugging Face2026-01-15 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/enPurified/smollm-corpus-fineweb-edu-enPurified-openai-messages
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: odc-by tags: - nlp - prose - filtered - educational - quality-filtered - fineweb task_categories: - text-generation source_datasets: - HuggingFaceTB/smollm-corpus pretty_name: Smollm FineWeb Edu enPurified size_categories: - 100K<n<1M --- # 📖 smollm-corpus-fineweb-edu-enPurified-openai-messages **smollm-corpus-fineweb-edu-enPurified** is a highly curated, "prose-first" subset of the `fineweb-edu-dedup` subset found in [HuggingFaceTB/smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus). The `enPurified` collection is built on a specific philosophy: **Specialization**. While the original dataset is excellent for general pre-training, high-quality fluent English prose often gets diluted when mixed with syntax-heavy code, rigid math formulas, or low-information web junk. This dataset took the **190 million** row FineWeb-Edu dataset and pruned it down to **17,466,304** rows through a series of heuristic tests to retain the best quality English data. Then, the data was wrapped in the openai messages format after taking the first paragraph out of the data and injecting that into the prompt in order to create high quality SFT data. --- ## 💎 Why the "OpenAI Messages" Format? This dataset has been standardized into the **OpenAI Messages** format. Since the source data consists of educational web pages, they have been framed as an instructional interaction. ```json { "messages": [ {"role": "system", "content": "You are a helpful and knowledgeable AI assistant. Provide detailed, educational, and accurate responses."}, {"role": "user", "content": "Please explain the following concept in detail."}, {"role": "assistant", "content": "..."} ] } ``` **Value Proposition:** 1. **Universal Compatibility:** Plug-and-play compatibility with modern fine-tuning frameworks (Axolotl, Unsloth, LLaMA-Factory). 2. **Instructional Framing:** The data is framed as a request for detailed explanation, priming the model for long-form, educational output rather than simple completion. 3. **Thought Normalization:** Any existing "Chain of Thought" markers (like `<thought>`, `[THOUGHT]`) in the source text have been normalized to the standardized `<think>...</think>` format. --- ## 🛡️ The "Elite Quality Filter" (Heuristics) The core value of this dataset lies in what was **removed**. The data pipeline employed a series of strict heuristic tests (The "Obsidian" Protocol). A document was only retained if it passed **every single test** below. ### 1. Linguistic Purity & Structure | Test | Threshold | Intent | | --- | --- | --- | | **Boilerplate Removal** | Regex Match | **Remove Web Junk.** Filters out "cookie policies", "subscribe now", "all rights reserved", and ad-speak. | | **Word Count** | `Length >= 600 chars` | **Substance Check.** Removes snippets that are too short to provide educational value. | | **Sentence Count** | `Sentences >= 9` | **Ensure Depth.** Requires a minimum level of narrative complexity. | | **Repetitive Starts** | `Max Start Token < 32%` | **Structure Check.** Rejects text where too many sentences start with the same word (e.g., lists disguised as prose). | ### 2. Content Filtering (No Math/Code) | Test | Threshold | Intent | | --- | --- | --- | | **Digit Density** | `Digits < 7%` | **Remove Raw Data.** High digit density usually implies tables, logs, or financial reports rather than prose. | | **Code Syntax** | `Tech Symbols < 3%` | **Remove Programming.** Calculates ratio of `{ } [ ] / \ < >`. Rejects code blocks and JSON. | | **Code Keywords** | `Count < 4` OR `Density < 1.5%` | **Syntax Check.** Filters out text heavy in `def`, `class`, `import`, `function`, etc. | | **Math Density** | `LaTeX/Symbol Density < 12%` | **Remove Symbolic Logic.** Excludes heavy LaTeX usage ($$, \sum) and geometric symbols. | ### 3. Advanced Complexity Metrics | Test | Threshold | Intent | | --- | --- | --- | | **Stop Word Density** | `0.30 < Ratio < 0.58` | **Enforce Prose.** <30% indicates lists/keywords; >58% indicates low-information filler text. | | **Lexical Diversity (MTLD)** | `MTLD >= 50` | **Vocabulary Richness.** Measures the variety of unique words used. Rejects repetitive, simple text. | | **Gunning Fog Index** | `12 < Fog < 23` | **Target Audience: High School to Grad Student.** <br> ### 4. Integrity | Test | Threshold | Intent | | --- | --- | --- | | **Deduplication** | `MD5 Hash` | **Unique Data.** Hashes the normalized content (punctuation removed, lowercase) to prevent training on duplicates. | --- ## ⚖️ Credits & Licensing **Source Material:** This dataset is a filtered derivative of **[HuggingFaceTB/smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)** (specifically the `fineweb-edu-dedup` subset). * **Original Creator:** Hugging Face TB * **Original License:** Please refer to the-stack-v2 for the data license.
提供机构:
enPurified
二维码
社区交流群
二维码
科研交流群
商业服务