enPurified/finewiki-enPurified-openai-messages

Name: enPurified/finewiki-enPurified-openai-messages
Creator: enPurified
Published: 2026-01-12 00:32:48
License: 暂无描述

Hugging Face2026-01-12 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/enPurified/finewiki-enPurified-openai-messages

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc tags: - nlp - conversational - prose - filtered - quality-filtered - finewiki - sft task_categories: - text-generation source_datasets: - HuggingFaceFW/finewiki pretty_name: FineWiki enPurified (Prose Only) size_categories: - 100K<n<1M --- # 📖 FineWiki-enPurified-openai-messages **FineWiki-enPurified** is a high-fidelity, "prose-only" distillation of the [HuggingFaceFW/finewiki](https://huggingface.co/datasets/HuggingFaceFW/finewiki) dataset. The `enPurified` collection is built on a singular philosophy: **Eliminating the Noise.** While the modern ecosystem is saturated with datasets for coding and mathematics, the "art of the sentence" is often lost in the mix. This dataset removes the technical syntax, the math formulas, and the linguistic "junk" to provide a pure stream of high-quality English prose. This heuristic pruning reduced the finewiki dataset from **6,614,600** to ***4,276,530** rows of high quality English prose. --- ## 🎯 The enPurified Mission The goal of the **enPurified** initiative is to take high-value, massive-scale datasets and apply rigorous heuristic filtering to isolate the best English text. * **No Coding:** Removed via symbol-density checks. * **No Math:** Filtered for LaTeX and symbolic logic. * **No Foreign Languages:** Strict English-only identification. * **No Low-Quality Text:** "Junk" such as cookie policies, UI elements, and SEO link farms are purged. If the source material consists of extremely long strings (like whole novels from StandardEbooks or LongPage), the pipeline utilizes a high-quality **LangChain** script to intelligently chunk the text into logical paragraphs while wrapping them in relevant instructions. --- ## 💎 Why the "OpenAI Messages" Format? To ensure maximum utility for fine-tuning, the data is converted into the **OpenAI Messages** format (JSONL). ```json { "messages": [ {"role": "system", "content": "You are a helpful assistant..."}, {"role": "user", "content": "Write an informative piece about: [Title]"}, {"role": "assistant", "content": "[High Quality Prose]"} ] } ``` **Value Proposition:** 1. **Plug-and-Play:** Compatible out-of-the-box with training libraries like Axolotl, Unsloth, and LLaMA-Factory. 2. **Instruction-Tuned:** By transforming raw text into a Q&A/Instructional format, the model learns to deliver information rather than just completing a document. 3. **Thought Preservation:** Includes support for `<think>` tags where internal reasoning markers were present in the source. --- ## 🛡️ The "Elite Quality Filter" (Heuristics) The pipeline employs a series of strict tests. A row is only saved if it passes **every** check: ### 1. Linguistic Purity * **English-Only:** Every entry is validated to be English to prevent cross-lingual "bleeding." * **Stop Word Density:** We enforce a "prose flow" check. English relies on "glue words" (the, is, which). If the density is too low, the text is usually a log file, a list of keywords, or raw data—not natural language. ### 2. Content Filtering * **Code & Math Removal:** We calculate the ratio of technical symbols (`{ } [ ] \ < >`). High ratios trigger immediate exclusion. No LaTeX blocks or programming syntax allowed. * **Junk/Slop Removal:** Regex filters aggressively target "AI-isms" (e.g., "As an AI language model...") and web-crawled junk (e.g., "Terms of Service," "Cookie Settings"). * **Diversity Check:** A uniqueness ratio check ensures the text isn't repetitive or "gibberish." ### 3. Structural Integrity * **Substance Check:** Assistant responses must be between **250 and 3,000+ words** to ensure meaningful depth. * **Deduplication:** We use `md5` hashing to ensure every entry in the final dataset is unique, preventing the model from overfitting on repeated samples. --- ## ⚖️ Credits & Licensing **Source Material:** This dataset is a filtered derivative of **[HuggingFaceFW/finewiki](https://huggingface.co/datasets/HuggingFaceFW/finewiki)**. * **Original Creator:** Hugging Face * **License:** CC BY-SA 4.0. **Disclaimer:** While this pipeline aggressively removes low-quality text and refusals, the underlying data is sourced from the web. Users should conduct their own safety evaluations before deploying models trained on this data.

提供机构：

enPurified

5,000+

优质数据集

54 个

任务类型

进入经典数据集