camel-ai/seta-sft-kimi-k2.5-thinking
收藏Hugging Face2026-04-14 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/camel-ai/seta-sft-kimi-k2.5-thinking
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
task_categories:
- text-generation
tags:
- sft
- agent
- terminal
- tool-use
- qwen3
size_categories:
- 1K<n<10K
dataset_info:
features:
- name: task_id
dtype: string
- name: trial_uid
dtype: string
- name: reward
dtype: float64
- name: model
dtype: string
- name: conv_json_path
dtype: string
- name: provider_token_counts
struct:
- name: cached_tokens
dtype: int64
- name: completion_tokens
dtype: int64
- name: prompt_tokens
dtype: int64
- name: total_tokens
dtype: int64
- name: local_token_count
dtype: int64
- name: n_assistant_tokens
dtype: int64
- name: n_messages
dtype: int64
- name: raw_conv_json
dtype: string
- name: chat_template_str
dtype: string
- name: input_ids
list: int32
- name: loss_mask
list: int64
splits:
- name: train
num_bytes: 406433551
num_examples: 1768
download_size: 299197651
dataset_size: 406433551
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Seta SFT — Kimi K2.5 (thinking)
Supervised fine-tuning dataset distilled from 1488 successful
agent rollouts of **moonshot/kimi-k2.5** on the
[seta-env-v2](https://huggingface.co/datasets/camel-ai/seta-env-v2)
terminal-agent benchmark, tokenized with the **Qwen/Qwen3-8B** chat template
and ready for AREAL `FSDPLMEngine` SFT training.
## Schema
Each row preserves the full per-trial diagnostic record from the build
pipeline so consumers can inspect, filter, or re-tokenize without rerunning
the rollouts:
| column | type | meaning |
|---|---|---|
| `task_id` | str | task name (without `_t<i>_<hash>` suffix) |
| `trial_uid` | str | full trial dir name |
| `reward` | float | passed/total tests from `verifier/ctrf.json` |
| `model` | str | rollout model id (e.g. `kimi-k2.5`) |
| `conv_json_path` | str | source conv json path on the build machine (provenance) |
| `provider_token_counts` | dict | LLM API counts: `prompt_tokens`, `completion_tokens`, `total_tokens`, `cached_tokens` |
| `local_token_count` | int | `len(input_ids)` under the Qwen3-8B tokenizer |
| `n_assistant_tokens` | int | `sum(loss_mask)` |
| `n_messages` | int | reconstructed conversation length (incl. final assistant turn) |
| `raw_conv_json` | str | full original conv json, serialized |
| `chat_template_str` | str | exact rendered template that was tokenized |
| `input_ids` | list[int] | tokenized full conversation |
| `loss_mask` | list[int] | same length, `1` on every assistant span (incl. `<think>` reasoning blocks, content, and `<tool_call>...</tool_call>`), `0` on system / user / tool-response / padding |
For AREAL training, the loader
[`scripts/areal_sft/seta_sft_dataset.py::get_seta_sft_dataset`](https://github.com/camel-ai/terminal_agent/blob/areal_dev/scripts/areal_sft/seta_sft_dataset.py)
projects each row to `(input_ids, loss_mask)` only — the trainer's
`pad_sequences_to_tensors` collator consumes nothing else.
## Statistics
| | value |
|---|---|
| rows | **1,488** |
| total tokens | **15,688,749** |
| trainable tokens | **7,787,881** (49.6%) |
| tokens / row (mean) | 10544 |
| tokens / row (median) | 8884 |
| tokens / row (max) | 28,937 |
| reward = 1.0 rows | 1,112 (74.7%) |
| reward (mean) | 0.932 |
## Provenance & processing
1. **Rollouts.** moonshot/kimi-k2.5 was run via the TITO (Token-In-Token-Out)
agent on every task in `seta-env-v2` (1 trajectory per task, max 200
iterations, 28 672 max total tokens, 4 096 max completion tokens,
temperature 1.0, thinking enabled). Tasks that crashed or hit rate limits
were re-run via `eval.py --resume` until convergence (4 resume passes).
2. **Merging.** Resume chain merged into one canonical dir via
`seta_env.utils.collect_results --merge --collect-trials move` —
1605 trial subdirs total, 18 fully-failed tasks, 1488 successfully
produced both `verifier/ctrf.json` and a conversation snapshot in
`CAMEL_LOG_DIR/`.
3. **Tokenization & loss-mask construction.**
`seta_env.utils.sft_utils.build_sft_dataset` walks each trial dir, picks
the largest `CAMEL_LOG_DIR/.../conv_*.json` (the most-complete snapshot),
reconstructs `request.messages + response.choices[0].message`, applies
the Qwen3-8B chat template, and builds the per-token loss mask via a
token-stream scan that marks every `<|im_start|>assistant ... <|im_end|>\n`
span as trainable. Boundary tokens (`<|im_start|>`, `<|im_end|>`,
`<think>`, `</think>`, `<tool_call>`, trailing `\n`) are all attributed
to the correct side of the mask — verified by per-trial inspection
artifacts under the build script's `--debug` mode.
4. **Filtering.** Trials whose `verifier/ctrf.json` was missing entirely
(errored or timed out before verification ran) were dropped — 117 of the
1605 trial subdirs. Trials with any verified reward (including 0/N
rollouts) were kept; downstream consumers can further filter on `reward`
directly using the column on this dataset.
5. **Thinking handling.** Each assistant turn includes its full `<think>...</think>` reasoning trace (extracted from the rollout's `reasoning_content` field) followed by the visible content + tool calls. The model is trained to **emit chain-of-thought** before every response.
## Companion datasets
- [camel-ai/seta-sft-kimi-k2.5-nothink](https://huggingface.co/datasets/camel-ai/seta-sft-kimi-k2.5-nothink) — same
rollouts, no-thinking variant. Useful for ablations.
## Loading
For inspection / custom processing (full row):
```python
from datasets import load_dataset
ds = load_dataset("camel-ai/seta-sft-kimi-k2.5-thinking", split="train")
print(ds[0]["task_id"], ds[0]["reward"], ds[0]["local_token_count"])
print(ds[0]["chat_template_str"][:500])
```
For AREAL-trainer-ready loading (projected to `(input_ids, loss_mask)`):
```python
from seta_sft_dataset import get_seta_sft_dataset
ds = get_seta_sft_dataset("camel-ai/seta-sft-kimi-k2.5-thinking", split="train", max_length=16384)
# Dataset({features: ['input_ids', 'loss_mask'], num_rows: ...})
```
## License
Apache 2.0 (matches the upstream `camel-ai/seta-env-v2` license).
提供机构:
camel-ai



