lightseekorg/kimi-mtp-dataset
收藏Hugging Face2026-03-31 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/lightseekorg/kimi-mtp-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: conversations
list:
- name: from
dtype: string
- name: value
dtype: string
- name: source
dtype: string
splits:
- name: train
num_examples: 476904
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: apache-2.0
language:
- en
- zh
tags:
- speculative-decoding
- eagle3
- kimi-k2.5
- draft-model
- conversations
---
# Kimi-K2.5 Eagle3 Training Data
This dataset contains the instruction-following data used to train an **Eagle3 MTP draft model** for [Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) with [TorchSpec](https://github.com/torchspec-project/TorchSpec).
All responses were **regenerated by running Kimi-K2.5 via Engine** rather than taken from the original datasets. This is critical for speculative decoding training: the draft model must learn the exact token-level distribution of the target model it is accelerating.
The trained Eagle3 draft model is available at [lightseekorg/kimi-k2.5-eagle3](https://huggingface.co/lightseekorg/kimi-k2.5-eagle3). If you find this draft model useful, please give our project **TorchSpec** a 🌟 on [GitHub](https://github.com/torchspec-project/TorchSpec).
## Data source
Due to inference resource constraints, some source datasets are only partially regenerated. Here is the list of source datasets used in this mix:
| Dataset | Source | # Samples |
|---------|--------|-----------|
| [mlabonne/open-perfectblend](https://huggingface.co/datasets/mlabonne/open-perfectblend) | `perfectblend` | 296,034 |
| [liuhaotian/LLaVA-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) | `llava_instruct` | 123,102 |
| [HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) | `smoltalk_cn` | 48,333 |
| [daviddjtzafon/continual-tool-kimi-k2.5](https://huggingface.co/datasets/daviddjtzafon/continual-tool-kimi-k2.5) | `continual_tool_kimi` | 4,370 |
| [crownelius/KimiK2.5-2000x-formatted](https://huggingface.co/datasets/crownelius/KimiK2.5-2000x-formatted) | `kimi_2000x` | 2,144 |
| [crownelius/Creative-Writing-KimiK2.5-Cleaned](https://huggingface.co/datasets/crownelius/Creative-Writing-KimiK2.5-Cleaned) | `creative_writing` | 1,393 |
| [DCAgent2/terminal_bench_2](https://huggingface.co/datasets/DCAgent2/terminal_bench_2__together_ai_moonshotai_Kimi-K2.5_20260203) | `dcagent` | 873 |
| [crownelius/Creative-Writing-Reasoning-KimiK2.5-600x](https://huggingface.co/datasets/crownelius/Creative-Writing-Reasoning-KimiK2.5-600x) | `creative_writing_reasoning` | 655 |
| **Total** | | **476,904** |
## Data format
Each sample contains two fields:
- **`conversations`**: a list of turns, each with `from` (`human` / `gpt` / `system`) and `value` (string).
- **`source`**: the name of the source dataset (see table above).
```json
{
"conversations": [
{"from": "human", "value": "What is the capital of France?"},
{"from": "gpt", "value": "The capital of France is Paris."}
],
"source": "perfectblend"
}
```
Multimodal samples (`llava_instruct`) use OpenAI vision format in the `value` field — a list of `image_url` and `text` objects — with local image paths replaced by public COCO URLs (`http://images.cocodataset.org/train2017/{filename}`).
Function-call samples (`continual_tool_kimi`) use Kimi-K2.5's special token format for tool calls:
```
<|tool_calls_section_begin|><|tool_call_begin|>{id}<|tool_call_argument_begin|>{args_json}<|tool_call_end|><|tool_calls_section_end|>
```
Tool results are serialized as `human` turns with the prefix `## Return of {call_id}\n`.
## Training
See [TorchSpec](https://github.com/torchspec-project/TorchSpec) for the full training recipe, configuration, and evaluation results.
## License
Apache 2.0. All source datasets are Apache 2.0 or MIT licensed.
提供机构:
lightseekorg



