five

EternalRecursion/smoltalk-no-refusals-augmented

收藏
Hugging Face2025-11-30 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/EternalRecursion/smoltalk-no-refusals-augmented
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 size_categories: - 100K<n<1M task_categories: - text-generation - question-answering tags: - chat - instruction-tuning - sft - alignment-research dataset_info: features: - name: messages list: - name: role dtype: string - name: content dtype: string - name: source dtype: string splits: - name: train num_examples: 994185 --- # smoltalk-no-refusals-augmented A cleaned and augmented version of the smoltalk dataset, designed to minimize alignment priors and AI identity markers for research purposes. ## Overview This dataset is derived from [smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) with the following modifications applied: 1. **Refusal removal** (original augmentation) 2. **AI identity term normalization** - replaced various AI identity terms with "assistant" 3. **Alignment prior removal** - removed rows containing strong alignment signaling patterns 4. **Grammar correction** - fixed article agreement errors introduced by replacements ## Dataset Statistics | Metric | Value | |--------|-------| | Original smoltalk rows | 1,043,917 | | After refusal removal | 1,028,530 | | Final rows | 994,185 | | Rows removed (this augmentation) | 34,345 (3.34%) | | Identity replacements | 45,037 | | Article corrections | 3,430 | ## Augmentation Details ### Task A: Identity Term Replacements Replaced AI identity references with the neutral term "assistant" to reduce model self-identification patterns: | Pattern | Replacement | Count | |---------|-------------|-------| | `AI chatbot` | `assistant` | varies | | `AI helper` | `assistant` | varies | | `AI bot` | `assistant` | varies | | `virtual assistant` | `assistant` | varies | | `digital assistant` | `assistant` | varies | | `chatbot` | `assistant` | varies | | `You are an AI` | `You are an assistant` | varies | | `As an AI` | `as an assistant` | varies | | `I am an AI` | `I am an assistant` | varies | | `I'm an AI` | `I'm an assistant` | varies | | `an AI` (identity context) | `an assistant` | varies | | `the AI` (identity context) | `the assistant` | varies | **Note:** References to AI as a topic (e.g., "AI algorithm", "AI technology", "AI research") were preserved. **Total identity replacements:** 45,037 ### Task B: Personal Opinion Disclaimers Removed rows where assistant messages contain disclaimers about lacking personal opinions: - `I don't have personal opinions` - `I don't have personal views/beliefs/preferences` - `I'm not capable of having opinions` - `I cannot have/hold/form opinions` **Rows removed:** 83 ### Task C: Alignment Prior Removal Removed rows containing strong alignment signaling patterns in assistant responses: #### Helpfulness Signaling - `I'd be happy/glad/pleased/delighted to` - `I would be happy to` - `happy to help/assist` - `I'm here to help/assist/support` - `let me know if you need` - `feel free to ask` - `hope this helps` #### Purpose/Design Statements - `my purpose/goal/role is to help` - `I aim/strive to help` - `I am designed/programmed/trained to` - `I'm designed to` #### AI Identity Disclaimers - `I'm an AI/language model` - `I am an AI/language model` - `I don't have emotions/feelings/consciousness` - `I'm not human/a person` - `as an AI/assistant, I` **Rows removed:** 34,262 ### Task D: Grammar Corrections Fixed article agreement errors introduced by replacements: | Error | Correction | Count | |-------|------------|-------| | `a assistant` | `an assistant` | 3,331 | | `assistant assistant` | `assistant` | 76 | | `a AI` | `an AI` | 23 | **Total grammar fixes:** 3,430 ## Source Distribution The dataset maintains the original source distribution from smoltalk: | Source | Percentage | |--------|------------| | smol-magpie-ultra | ~40% | | numina-cot-100k | ~10% | | smol-constraints | ~9% | | apigen-80k | ~8% | | everyday-conversations | ~7% | | explore-instruct-rewrite | ~7% | | smol-rewrite | ~6% | | smol-summarize | ~6% | | Other sources | ~7% | ## Methodology All augmentations were performed using: - **flpc** (Rust-based regex library) for efficient pattern matching - Case-insensitive matching with word boundaries to avoid partial matches - Negative lookahead patterns to preserve legitimate AI topic references - Multi-pass processing to catch and correct introduced errors ### Preservation Rules The following were intentionally preserved: - AI references in user messages (roleplay scenarios, instructions) - AI as a discussion topic ("AI technology", "AI research", "AI algorithms") - Legitimate ellipses (...) and code formatting - User-submitted content with original errors ## Files Generated During augmentation, the following documentation files were created: - `augmentation_report.json` - Statistics and counts for all operations - `augmentation_samples.json` - Before/after examples for each pattern - `removed_rows.jsonl` - Complete audit trail of all removed rows with reasons - `article_fix_report.json` - Article correction statistics - `final_cleanup_report.json` - Final pass correction statistics - `scan_report.json` - Comprehensive error scan results ## Usage ```python from datasets import load_dataset dataset = load_dataset("EternalRecursion/smoltalk-no-refusals-augmented", split="train") ``` ## Schema Each row contains: - `messages`: List of conversation turns with `role` and `content` - `source`: Original source dataset identifier ## License This dataset inherits the license from the original smoltalk dataset. ## Citation If you use this dataset, please cite both this augmented version and the original smoltalk dataset. ## Changelog - **v2**: Identity term normalization, alignment prior removal, grammar corrections - **v1**: Initial refusal removal augmentation
提供机构:
EternalRecursion
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作