five

Kassadin88/Claude-Distillation-Dataset

收藏
Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Kassadin88/Claude-Distillation-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation - question-answering language: - en tags: - claude - distillation - reasoning - instruction-tuning size_categories: - 10K<n<100K --- # Claude Distillation Dataset > **Note**: This dataset is a curated collection of open-source data. All data comes from publicly available datasets on Hugging Face. This repo only provides unified formatting and deduplication. **All credits go to the original data creators.** ## Data Sources (Open Source) All data in this dataset is sourced from the following **open-source datasets** on Hugging Face: | Source | Samples | Description | |--------|---------|-------------| | claude-opus-4.6-10000x | 9,633 | Claude Opus 4.6 multi-task data | | claude-opus-4.6-high-reasoning-700x | 758 | High-quality reasoning data | | Claude-Opus-4.6-Reasoning-887x | 887 | Reasoning task data | | Claude-Opus-4.6-Reasoning-500x | 500 | Reasoning task data | | Claude-Sonnet-X-Opus-4.6-Reasoning-small-500 | 524 | Sonnet & Opus mixed data | | claude-4.5-opus-high-reasoning-250x | 250 | Claude 4.5 Opus reasoning data | | **Total** | **12,525** | (after deduplication) | ## What This Repo Does This repository only provides: 1. **Unified formatting**: Converted all data sources to a consistent messages format 2. **Deduplication**: Removed 27 duplicate samples 3. **Documentation**: Added data statistics and usage instructions **I did NOT create any of the original data. Please refer to the original datasets for licensing and terms of use.** ## Data Format ```json { "messages": [ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "Question content"}, {"role": "assistant", "content": "Answer with thinking process"} ] } ``` ### Thinking Process Format Assistant responses include thinking process using special tokens to mark the thinking section, followed by the final answer. ## Statistics - **Total samples**: 12,525 conversations (after deduplication) - **Average length**: 3,504 characters per sample - **Total characters**: ~44M ### System Message Distribution - With non-empty system message: 10,993 (87.8%) - With empty system message: 250 (2.0%) - No system message: 1,282 (10.2%) Most system messages contain: `"You are a helpful AI assistant."` ### Other Statistics - user messages: 12,581 - assistant messages: 12,669 - tool messages: 88 ## Usage ```python from datasets import load_dataset dataset = load_dataset("Kassadin88/Claude-Distillation-Dataset") ``` ## License This dataset is for research and educational purposes only. **Please follow the terms of use of the original data sources.** ## Acknowledgments Thanks to all original data creators and providers. This is just a curated collection of their work.
提供机构:
Kassadin88
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作