AmanPriyanshu/tool-reasoning-sft-TOOLS-ToolMind-data-cleaned-rectified
收藏Hugging Face2026-03-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/tool-reasoning-sft-TOOLS-ToolMind-data-cleaned-rectified
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
tags:
- function-calling
- tool-calling
- reasoning
- tool-use
- multi-turn
size_categories:
- 100K<n<1M
---
# ToolMind — Cleaned & Rectified
~280K multi-turn tool-use conversations converted into a strict reasoning + tool-call format. Combines 128K synthetic trajectories generated via graph-based function chain sampling with 152K augmented open-source instances across 6 established datasets.
## Format
Each row contains a structured multi-turn conversation with explicit reasoning traces and validated tool calls.
### Message Roles
| Role | Content |
|---|---|
| `system` | Tool-use protocol + JSON tool schemas + domain-specific instructions |
| `user` | User queries and follow-ups |
| `reasoning` | `<think>…</think>` — model's step-by-step reasoning |
| `tool_call` | `<tool_call>{"name": "...", "arguments": {...}}</tool_call>` — function invocation |
| `tool_output` | `<tool_response>…</tool_response>` — environment result |
| `answer` | `<answer>…</answer>` — final response to user |
### Trajectory Structure
```
system → user → reasoning → tool_call → tool_output → reasoning → ... → answer
↑ |
└────────────────────── (multi-turn loop) ──────────────────────────┘
```
## Schema
Single Parquet file with zstd compression. Messages column is a JSON string; parse with `json.loads()`.
| Column | Type | Description |
|---|---|---|
| `messages` | string | Converted multi-turn conversation (JSON list of `{role, content}`) |
| `split` | string | Source dataset identifier |
## Splits
| Split | Rows | Description |
|---|---|---|
| `synthetic` | 127,863 | Graph-based function chain trajectories (20K+ tools) |
| `xlam-function-calling-60k` | 64,897 | xLAM function calling examples |
| `APIGen-MT-5k` | 23,546 | API generation multi-turn dialogues |
| `glaive-function-calling-v2` | 18,401 | Glaive function calling v2 |
| `When2Call` | 17,492 | Selective tool invocation examples |
| `BUTTONInstruct` | 14,659 | Multi-tool instruction following |
| `tau-train` | 12,879 | Tau-bench training trajectories |
## Conversion Details
- Tool schemas extracted from each row's `tools` array and embedded in the system prompt
- System prompt ordering: tool-use protocol + schemas first, then original domain instructions
- `think` tool calls absorbed into `reasoning` turns (not emitted as tool_calls)
- `<think>` blocks in assistant content extracted into dedicated `reasoning` turns
- Multiple tool_calls in a single assistant message split into individual turns
- `tool` role responses wrapped in `<tool_response>...</tool_response>`
- Assistant conversational replies converted to `answer` turns
- Bridge reasoning inserted for invalid transitions (e.g. `tool_output→tool_call`, `user→tool_call`)
- Consecutive reasoning turns merged
- 12 template variations each for bridge reasoning, post-tool reasoning, tail reasoning, and tail answers
- All role transitions validated against strict FSM
- All tool_call JSON validated for parseability and required keys (`name`, `arguments`)
- Empty `<think>`, `<tool_call>`, and `<answer>` blocks rejected
## Validated Transitions
```
system → user
user → reasoning
reasoning → tool_call OR reasoning → answer
tool_call → tool_output
tool_output → reasoning
answer → user (multi-turn loop)
```
## Usage
```py
import json, random
from collections import Counter, defaultdict
from huggingface_hub import hf_hub_download
import pyarrow.parquet as pq
REPO = "AmanPriyanshu/ToolMind-data-cleaned-rectified"
print("Downloading data.parquet...")
local = hf_hub_download(REPO, "data.parquet", repo_type="dataset")
t = pq.read_table(local)
print(f"Rows: {t.num_rows:,}")
print(f"Columns: {t.column_names}\n")
# ── splits ───────────────────────────────────────────────────────────────────
splits = [s.as_py() for s in t.column("split")]
sc = Counter(splits)
print("Splits:")
for s, c in sc.most_common():
print(f" {s:40s} {c:>8,} ({100*c/len(splits):.1f}%)")
print()
# ── role stats ───────────────────────────────────────────────────────────────
role_counts = Counter()
turn_lengths = []
for i in range(t.num_rows):
msgs = json.loads(t.column("messages")[i].as_py())
turn_lengths.append(len(msgs))
for m in msgs:
role_counts[m["role"]] += 1
print("Role distribution (across all turns):")
for r, c in role_counts.most_common():
print(f" {r:15s} {c:>10,}")
print()
print("Turn length stats:")
turn_lengths.sort()
n = len(turn_lengths)
print(f" min={turn_lengths[0]} median={turn_lengths[n//2]} mean={sum(turn_lengths)/n:.1f} max={turn_lengths[-1]}")
print(f" p25={turn_lengths[n//4]} p75={turn_lengths[3*n//4]} p95={turn_lengths[int(n*0.95)]}")
print()
# ── sample ───────────────────────────────────────────────────────────────────
idx = random.randint(0, t.num_rows - 1)
row = {col: t.column(col)[idx].as_py() for col in t.column_names}
row["messages"] = json.loads(row["messages"])
with open("ToolMind/sample.json", "w") as f:
json.dump(row, f, indent=2, ensure_ascii=False)
roles = [m["role"] for m in row["messages"]]
print(f"Sample (idx={idx}, split={row['split']}, turns={len(roles)}):")
print(f" Roles: {' -> '.join(roles[:20])}{'...' if len(roles)>20 else ''}")
print(f" Saved -> ToolMind/sample.json")
```
## License
Apache-2.0
提供机构:
AmanPriyanshu



