Kassadin88/Claude-Distillation-Dataset
收藏Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Kassadin88/Claude-Distillation-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
- question-answering
language:
- en
tags:
- claude
- distillation
- reasoning
- instruction-tuning
size_categories:
- 10K<n<100K
---
# Claude Distillation Dataset
> **Note**: This dataset is a curated collection of open-source data. All data comes from publicly available datasets on Hugging Face. This repo only provides unified formatting and deduplication. **All credits go to the original data creators.**
## Data Sources (Open Source)
All data in this dataset is sourced from the following **open-source datasets** on Hugging Face:
| Source | Samples | Description |
|--------|---------|-------------|
| claude-opus-4.6-10000x | 9,633 | Claude Opus 4.6 multi-task data |
| claude-opus-4.6-high-reasoning-700x | 758 | High-quality reasoning data |
| Claude-Opus-4.6-Reasoning-887x | 887 | Reasoning task data |
| Claude-Opus-4.6-Reasoning-500x | 500 | Reasoning task data |
| Claude-Sonnet-X-Opus-4.6-Reasoning-small-500 | 524 | Sonnet & Opus mixed data |
| claude-4.5-opus-high-reasoning-250x | 250 | Claude 4.5 Opus reasoning data |
| **Total** | **12,525** | (after deduplication) |
## What This Repo Does
This repository only provides:
1. **Unified formatting**: Converted all data sources to a consistent messages format
2. **Deduplication**: Removed 27 duplicate samples
3. **Documentation**: Added data statistics and usage instructions
**I did NOT create any of the original data. Please refer to the original datasets for licensing and terms of use.**
## Data Format
```json
{
"messages": [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Question content"},
{"role": "assistant", "content": "Answer with thinking process"}
]
}
```
### Thinking Process Format
Assistant responses include thinking process using special tokens to mark the thinking section, followed by the final answer.
## Statistics
- **Total samples**: 12,525 conversations (after deduplication)
- **Average length**: 3,504 characters per sample
- **Total characters**: ~44M
### System Message Distribution
- With non-empty system message: 10,993 (87.8%)
- With empty system message: 250 (2.0%)
- No system message: 1,282 (10.2%)
Most system messages contain: `"You are a helpful AI assistant."`
### Other Statistics
- user messages: 12,581
- assistant messages: 12,669
- tool messages: 88
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("Kassadin88/Claude-Distillation-Dataset")
```
## License
This dataset is for research and educational purposes only. **Please follow the terms of use of the original data sources.**
## Acknowledgments
Thanks to all original data creators and providers. This is just a curated collection of their work.
提供机构:
Kassadin88



