five

AmanPriyanshu/tool-reasoning-sft-TOOLS-ToolMind-data-cleaned-rectified

收藏
Hugging Face2026-03-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/tool-reasoning-sft-TOOLS-ToolMind-data-cleaned-rectified
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - en tags: - function-calling - tool-calling - reasoning - tool-use - multi-turn size_categories: - 100K<n<1M --- # ToolMind — Cleaned & Rectified ~280K multi-turn tool-use conversations converted into a strict reasoning + tool-call format. Combines 128K synthetic trajectories generated via graph-based function chain sampling with 152K augmented open-source instances across 6 established datasets. ## Format Each row contains a structured multi-turn conversation with explicit reasoning traces and validated tool calls. ### Message Roles | Role | Content | |---|---| | `system` | Tool-use protocol + JSON tool schemas + domain-specific instructions | | `user` | User queries and follow-ups | | `reasoning` | `<think>…</think>` — model's step-by-step reasoning | | `tool_call` | `<tool_call>{"name": "...", "arguments": {...}}</tool_call>` — function invocation | | `tool_output` | `<tool_response>…</tool_response>` — environment result | | `answer` | `<answer>…</answer>` — final response to user | ### Trajectory Structure ``` system → user → reasoning → tool_call → tool_output → reasoning → ... → answer ↑ | └────────────────────── (multi-turn loop) ──────────────────────────┘ ``` ## Schema Single Parquet file with zstd compression. Messages column is a JSON string; parse with `json.loads()`. | Column | Type | Description | |---|---|---| | `messages` | string | Converted multi-turn conversation (JSON list of `{role, content}`) | | `split` | string | Source dataset identifier | ## Splits | Split | Rows | Description | |---|---|---| | `synthetic` | 127,863 | Graph-based function chain trajectories (20K+ tools) | | `xlam-function-calling-60k` | 64,897 | xLAM function calling examples | | `APIGen-MT-5k` | 23,546 | API generation multi-turn dialogues | | `glaive-function-calling-v2` | 18,401 | Glaive function calling v2 | | `When2Call` | 17,492 | Selective tool invocation examples | | `BUTTONInstruct` | 14,659 | Multi-tool instruction following | | `tau-train` | 12,879 | Tau-bench training trajectories | ## Conversion Details - Tool schemas extracted from each row's `tools` array and embedded in the system prompt - System prompt ordering: tool-use protocol + schemas first, then original domain instructions - `think` tool calls absorbed into `reasoning` turns (not emitted as tool_calls) - `<think>` blocks in assistant content extracted into dedicated `reasoning` turns - Multiple tool_calls in a single assistant message split into individual turns - `tool` role responses wrapped in `<tool_response>...</tool_response>` - Assistant conversational replies converted to `answer` turns - Bridge reasoning inserted for invalid transitions (e.g. `tool_output→tool_call`, `user→tool_call`) - Consecutive reasoning turns merged - 12 template variations each for bridge reasoning, post-tool reasoning, tail reasoning, and tail answers - All role transitions validated against strict FSM - All tool_call JSON validated for parseability and required keys (`name`, `arguments`) - Empty `<think>`, `<tool_call>`, and `<answer>` blocks rejected ## Validated Transitions ``` system → user user → reasoning reasoning → tool_call OR reasoning → answer tool_call → tool_output tool_output → reasoning answer → user (multi-turn loop) ``` ## Usage ```py import json, random from collections import Counter, defaultdict from huggingface_hub import hf_hub_download import pyarrow.parquet as pq REPO = "AmanPriyanshu/ToolMind-data-cleaned-rectified" print("Downloading data.parquet...") local = hf_hub_download(REPO, "data.parquet", repo_type="dataset") t = pq.read_table(local) print(f"Rows: {t.num_rows:,}") print(f"Columns: {t.column_names}\n") # ── splits ─────────────────────────────────────────────────────────────────── splits = [s.as_py() for s in t.column("split")] sc = Counter(splits) print("Splits:") for s, c in sc.most_common(): print(f" {s:40s} {c:>8,} ({100*c/len(splits):.1f}%)") print() # ── role stats ─────────────────────────────────────────────────────────────── role_counts = Counter() turn_lengths = [] for i in range(t.num_rows): msgs = json.loads(t.column("messages")[i].as_py()) turn_lengths.append(len(msgs)) for m in msgs: role_counts[m["role"]] += 1 print("Role distribution (across all turns):") for r, c in role_counts.most_common(): print(f" {r:15s} {c:>10,}") print() print("Turn length stats:") turn_lengths.sort() n = len(turn_lengths) print(f" min={turn_lengths[0]} median={turn_lengths[n//2]} mean={sum(turn_lengths)/n:.1f} max={turn_lengths[-1]}") print(f" p25={turn_lengths[n//4]} p75={turn_lengths[3*n//4]} p95={turn_lengths[int(n*0.95)]}") print() # ── sample ─────────────────────────────────────────────────────────────────── idx = random.randint(0, t.num_rows - 1) row = {col: t.column(col)[idx].as_py() for col in t.column_names} row["messages"] = json.loads(row["messages"]) with open("ToolMind/sample.json", "w") as f: json.dump(row, f, indent=2, ensure_ascii=False) roles = [m["role"] for m in row["messages"]] print(f"Sample (idx={idx}, split={row['split']}, turns={len(roles)}):") print(f" Roles: {' -> '.join(roles[:20])}{'...' if len(roles)>20 else ''}") print(f" Saved -> ToolMind/sample.json") ``` ## License Apache-2.0
提供机构:
AmanPriyanshu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作