AmanPriyanshu/tool-reasoning-sft-RESEARCH-OpenSeeker-v1-Data
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/tool-reasoning-sft-RESEARCH-OpenSeeker-v1-Data
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
language:
- en
tags:
- reasoning
- tool-calling
- agentic
- multi-turn
- deep-search
- multi-step-reasoning
size_categories:
- 1K<n<10K
---
# OpenSeeker v1 — Cleaned & Rectified
7,189 multi-turn deep-search agent trajectories converted into a strict reasoning + tool-call format with validated FSM transitions.
## Origin
Derived from [OpenSeeker/OpenSeeker-v1-Data](https://huggingface.co/datasets/OpenSeeker/OpenSeeker-v1-Data).
OpenSeeker is an open-source search agent system that democratizes access to frontier search capabilities by fully open-sourcing its training data. Fine-tuned on Qwen3-30B-A3B-Thinking-2507 with 11.7K training examples, it achieves state-of-the-art performance on frontier search benchmarks: 48.4 on BrowseComp-ZH, 29.5 on BrowseComp, 74.0 on xbench-DeepSearch, and 59.4 on WideSearch — surpassing Tongyi DeepResearch on BrowseComp-ZH (48.4% vs. 46.7%).
📄 **Paper:** [OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data](https://huggingface.co/datasets/OpenSeeker/OpenSeeker-v1-Data)
## Format
Each row contains a structured multi-turn conversation with explicit reasoning traces and validated tool calls.
### Message Roles
| Role | Content |
|---|---|
| `system` | Tool-use protocol + JSON tool schemas + QA agent instructions |
| `user` | Research question |
| `reasoning` | `<think>…</think>` — model's step-by-step reasoning |
| `tool_call` | `<tool_call>{"name": "...", "arguments": {...}}</tool_call>` — function invocation |
| `tool_output` | `<tool_response>…</tool_response>` — tool execution result |
| `answer` | `<answer>…</answer>` — final synthesized response |
### Trajectory Structure
```
system → user → reasoning → [tool_call → tool_output → reasoning →]* answer
```
Trajectories range from 4 to 604 turns, with 0–200 tool calls per row (avg 37.3).
## Schema
Single Parquet file with zstd compression.
| Column | Type | Description |
|---|---|---|
| `messages` | string | Converted conversation (JSON list of `{role, content}`) |
| `trajectory_correctness` | string | Source label: `Correct` or `Incorrect` |
## Tools
2 tools available per trajectory:
| Tool | Description |
|---|---|
| `search` | Batched web search — supply array of query strings, retrieves top 10 results per query |
| `visit` | Parse webpage(s) and return summary according to a goal |
## Filtering
From the original 11,677 rows:
- **All** rows with `trajectory correctness == "Correct"` retained (4,949)
- **1/3** of rows with `trajectory correctness != "Correct"` randomly sampled (2,242 of 6,728)
- Total input after filtering: 7,191 rows
- After conversion (2 failures): **7,189 rows**
## Correctness Distribution
| Label | Rows |
|---|---|
| `Correct` | 4,948 |
| `Incorrect` | 2,241 |
## Conversion Details
- Source data uses `<think>` tags in assistant messages and `<tool_calls_begin>/<tool_call>/<tool_calls_end>` wrapper format for tool calls — conversion is **decomposition** of compound assistant messages into separate FSM turns
- Assistant messages with `<think>` + `<tool_calls_begin>` split into separate `reasoning` + `tool_call` turns
- Assistant messages with `<think>` + `<answer>` split into separate `reasoning` + `answer` turns
- User messages containing `<tool_response>` mapped to `tool_output` turns
- Tool schemas extracted from `<tools>` XML block in system prompt, converted to clean JSON
- Single constant system prompt across all rows (tool-augmented QA agent with `search` and `visit`)
- Bridge reasoning synthesized only when FSM requires it (rare — source already has `<think>` blocks on 100% of assistant messages)
- 99.97% conversion rate (7,189/7,191); 2 failures from `tool_call→reasoning` transition violations in source
- Two validation layers: FSM transition check + content-tag non-empty check
## Statistics
| Metric | Value |
|---|---|
| Tool calls per row | min=0, max=200, avg=37.3 |
| Turns per row | min=4, max=604, avg=115.9 |
| Correct avg tool calls | ~28 |
| Incorrect avg tool calls | ~61 |
## Usage
```py
import json, random, re
from datasets import load_dataset
VALID_NEXT = {
"system": {"user"}, "user": {"reasoning"},
"reasoning": {"tool_call", "answer"}, "tool_call": {"tool_output"},
"tool_output": {"reasoning"}, "answer": {"user"},
}
ds = load_dataset("AmanPriyanshu/tool-reasoning-sft-OpenSeeker-v1-Data", split="train")
print(f"Loaded: {len(ds):,} rows\n")
idx = random.randint(0, len(ds) - 1)
row = ds[idx]
msgs = json.loads(row["messages"])
corr = row["trajectory_correctness"]
roles = [m["role"] for m in msgs]
tc = sum(1 for r in roles if r == "tool_call")
print(f"Row {idx} | correctness={corr} | {len(msgs)} turns | {tc} tool_calls")
print(f"Roles: {' -> '.join(roles[:20])}{'...' if len(roles)>20 else ''}\n")
# ── Validation 1: FSM transitions ────────────────────────────────────────────
bad = [(j, roles[j], roles[j+1]) for j in range(len(roles)-1)
if roles[j+1] not in VALID_NEXT.get(roles[j], set())]
if bad:
print(f"!! FSM VIOLATIONS: {len(bad)}")
for pos, a, b in bad[:5]:
print(f" [{pos}] {a} -> {b}")
else:
print("✓ FSM transitions: all valid")
# ── Validation 2: content tags ───────────────────────────────────────────────
tag_errors = []
for i, t in enumerate(msgs):
r, c = t["role"], t["content"]
if r == "reasoning":
if not re.search(r'<think>.+</think>', c, re.DOTALL):
tag_errors.append((i, r, "empty <think>"))
elif r == "tool_call":
if not re.search(r'<tool_call>.+</tool_call>', c, re.DOTALL):
tag_errors.append((i, r, "empty <tool_call>"))
else:
blob = c[c.find("{"):c.rfind("}") + 1]
try:
obj = json.loads(blob)
if "name" not in obj or "arguments" not in obj:
tag_errors.append((i, r, "missing name/arguments"))
except json.JSONDecodeError as e:
tag_errors.append((i, r, f"invalid JSON: {e}"))
elif r == "answer":
if not re.search(r'<answer>.+</answer>', c, re.DOTALL):
tag_errors.append((i, r, "empty <answer>"))
elif r == "tool_output":
if not re.search(r'<tool_response>.+</tool_response>', c, re.DOTALL):
tag_errors.append((i, r, "empty <tool_response>"))
if tag_errors:
print(f"!! TAG ERRORS: {len(tag_errors)}")
for pos, role, err in tag_errors[:5]:
print(f" [{pos}] {role}: {err}")
else:
print("✓ Content tags: all valid")
# ── Validation 3: structure checks ───────────────────────────────────────────
checks = []
if roles[0] != "system":
checks.append("first role is not system")
if roles[1] != "user":
checks.append("second role is not user")
if roles[-1] != "answer":
checks.append(f"last role is {roles[-1]}, expected answer")
if any(roles[i] == roles[i+1] for i in range(len(roles)-1)):
dupes = [(i, roles[i]) for i in range(len(roles)-1) if roles[i] == roles[i+1]]
checks.append(f"consecutive same-role at {dupes[0]}")
if checks:
print(f"!! STRUCTURE ISSUES: {len(checks)}")
for c in checks:
print(f" {c}")
else:
print("✓ Structure: system→user→...→answer, no consecutive duplicates")
# ── Print turns ──────────────────────────────────────────────────────────────
print(f"\n{'='*70}")
print(f"FULL CONVERSATION ({len(msgs)} turns)")
print(f"{'='*70}\n")
for i, m in enumerate(msgs):
content = m["content"]
if m["role"] == "system":
content = content[:200] + "..."
elif len(content) > 300:
content = content[:300] + "..."
print(f"[{i}] {m['role']}:\n{content}\n")
```
---
许可证:MIT
任务类别:
- 文本生成
语言:
- 英语
标签:
- 推理
- 工具调用
- 智能体
- 多轮对话
- 深度搜索
- 多步推理
规模类别:
- 1K<n<10K
---
# OpenSeeker v1 — 清理校正版
7189条多轮深度搜索智能体轨迹被转换为严格的推理+工具调用格式,并经过有限状态机(Finite State Machine, FSM)转换验证。
## 起源
本数据集源自[OpenSeeker/OpenSeeker-v1-Data](https://huggingface.co/datasets/OpenSeeker/OpenSeeker-v1-Data)。
OpenSeeker是一款开源搜索智能体系统,通过完全开源其训练数据,让前沿搜索能力的普惠获取成为可能。该模型基于Qwen3-30B-A3B-Thinking-2507微调,使用11.7K条训练样本,在前沿搜索基准测试中取得了顶尖性能:BrowseComp-ZH上48.4、BrowseComp上29.5、xbench-DeepSearch上74.0、WideSearch上59.4——在BrowseComp-ZH上超越了通义深度研究(Tongyi DeepResearch)(48.4% vs 46.7%)。
📄 **论文:** [OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data](https://huggingface.co/datasets/OpenSeeker/OpenSeeker-v1-Data)
## 数据格式
每一行包含结构化的多轮对话,带有明确的推理轨迹与经过验证的工具调用。
### 消息角色
| 角色 | 内容说明 |
|---|---|
| `system` | 工具使用协议 + JSON工具schema + 问答智能体指令 |
| `user` | 研究问题 |
| `reasoning` | `<think>…</think>` —— 模型的分步推理过程 |
| `tool_call` | `<tool_call>{"name": "...", "arguments": {...}}</tool_call>` —— 函数调用 |
| `tool_output` | `<tool_response>…</tool_response>` —— 工具执行结果 |
| `answer` | `<answer>…</answer>` —— 最终合成的响应 |
### 轨迹结构
system → user → reasoning → [tool_call → tool_output → reasoning →]* answer
轨迹的轮次范围为4至604轮,每一行包含0至200次工具调用(平均37.3次)。
## 数据结构
单个Parquet文件,采用zstd压缩算法。
| 列名 | 数据类型 | 说明 |
|---|---|---|
| `messages` | 字符串 | 转换后的对话(`{role, content}`格式的JSON列表) |
| `trajectory_correctness` | 字符串 | 源数据标签:`Correct`(正确)或`Incorrect`(错误) |
## 可用工具
每条轨迹支持2种工具:
| 工具名称 | 功能说明 |
|---|---|
| `search` | 批量网页搜索——传入查询字符串数组,为每个查询返回前10条搜索结果 |
| `visit` | 解析指定网页,并根据目标任务生成网页摘要 |
## 数据筛选流程
从原始11677条数据中:
- 保留所有`trajectory correctness == "Correct"`的行(共4949条)
- 从`trajectory correctness != "Correct"`的行中随机采样1/3(6728条中采样2242条)
- 筛选后总输入行数为7191
- 经过格式转换(2次失败)后最终得到**7189条**数据
## 正确性分布
| 标签 | 数据行数 |
|---|---|
| `Correct`(正确) | 4948 |
| `Incorrect`(错误) | 2241 |
## 转换细节
源数据在助手消息中使用`<think>`标签,工具调用采用`<tool_calls_begin>/<tool_call>/<tool_calls_end>`包装格式。本次转换工作为将复合助手消息分解为独立的有限状态机(FSM)轮次:
1. 将包含`<think>`+`<tool_calls_begin>`的助手消息拆分为独立的`reasoning`与`tool_call`轮次
2. 将包含`<think>`+`<answer>`的助手消息拆分为独立的`reasoning`与`answer`轮次
3. 将包含`<tool_response>`的用户消息映射为`tool_output`轮次
4. 从系统提示的`<tools>`XML块中提取工具schema,并转换为规范的JSON格式
5. 所有样本使用统一的系统提示(集成`search`与`visit`工具的增强型问答智能体)
6. 仅在有限状态机逻辑需要时合成桥接推理(场景极少——源数据100%的助手消息已自带`<think>`块)
7. 转换率达99.97%(7189/7191);2次失败源于源数据中`tool_call→reasoning`的转换违规
8. 包含两层验证机制:有限状态机转换检查与内容标签非空检查
## 统计指标
| 指标 | 数值 |
|---|---|
| 每条样本的工具调用次数 | 最小值0,最大值200,平均值37.3 |
| 每条样本的对话轮次 | 最小值4,最大值604,平均值115.9 |
| 正确轨迹平均工具调用次数 | 约28次 |
| 错误轨迹平均工具调用次数 | 约61次 |
## 使用示例
py
import json, random, re
from datasets import load_dataset
# 定义合法的角色转移规则
VALID_NEXT = {
"system": {"user"}, "user": {"reasoning"},
"reasoning": {"tool_call", "answer"}, "tool_call": {"tool_output"},
"tool_output": {"reasoning"}, "answer": {"user"},
}
# 加载训练集
ds = load_dataset("AmanPriyanshu/tool-reasoning-sft-OpenSeeker-v1-Data", split="train")
print(f"已加载:{len(ds):,} 条数据
")
# 随机选取一条样本
idx = random.randint(0, len(ds) - 1)
row = ds[idx]
msgs = json.loads(row["messages"])
corr = row["trajectory_correctness"]
roles = [m["role"] for m in msgs]
tc = sum(1 for r in roles if r == "tool_call")
print(f"样本 {idx} | 正确性标签={corr} | {len(msgs)} 轮对话 | {tc} 次工具调用")
print(f"对话角色顺序:{' -> '.join(roles[:20])}{'...' if len(roles)>20 else ''}
")
# ── 验证1:有限状态机转移检查 ────────────────────────────────────────────
bad = [(j, roles[j], roles[j+1]) for j in range(len(roles)-1)
if roles[j+1] not in VALID_NEXT.get(roles[j], set())]
if bad:
print(f"!! 发现有限状态机违规:{len(bad)} 处")
for pos, a, b in bad[:5]:
print(f" [{pos}] {a} -> {b}")
else:
print("✓ 有限状态机转移:全部合法")
# ── 验证2:内容标签检查 ───────────────────────────────────────────────
tag_errors = []
for i, t in enumerate(msgs):
r, c = t["role"], t["content"]
if r == "reasoning":
if not re.search(r'<think>.+</think>', c, re.DOTALL):
tag_errors.append((i, r, "缺少<think>标签或内容为空"))
elif r == "tool_call":
if not re.search(r'<tool_call>.+</tool_call>', c, re.DOTALL):
tag_errors.append((i, r, "缺少<tool_call>标签或内容为空"))
else:
blob = c[c.find("{"):c.rfind("}") + 1]
try:
obj = json.loads(blob)
if "name" not in obj or "arguments" not in obj:
tag_errors.append((i, r, "缺少工具名称或参数字段"))
except json.JSONDecodeError as e:
tag_errors.append((i, r, f"JSON格式无效:{e}"))
elif r == "answer":
if not re.search(r'<answer>.+</answer>', c, re.DOTALL):
tag_errors.append((i, r, "缺少<answer>标签或内容为空"))
elif r == "tool_output":
if not re.search(r'<tool_response>.+</tool_response>', c, re.DOTALL):
tag_errors.append((i, r, "缺少<tool_response>标签或内容为空"))
if tag_errors:
print(f"!! 发现内容标签错误:{len(tag_errors)} 处")
for pos, role, err in tag_errors[:5]:
print(f" [{pos}] {role}: {err}")
else:
print("✓ 内容标签:全部合法")
# ── 验证3:对话结构检查 ───────────────────────────────────────────
checks = []
if roles[0] != "system":
checks.append("首个角色不是system")
if roles[1] != "user":
checks.append("第二个角色不是user")
if roles[-1] != "answer":
checks.append(f"最后一个角色为{roles[-1]},预期应为answer")
if any(roles[i] == roles[i+1] for i in range(len(roles)-1)):
dupes = [(i, roles[i]) for i in range(len(roles)-1) if roles[i] == roles[i+1]]
checks.append(f"存在连续重复角色,首现位置:{dupes[0]}")
if checks:
print(f"!! 发现结构问题:{len(checks)} 处")
for c in checks:
print(f" {c}")
else:
print("✓ 对话结构:符合system→user→...→answer格式,无连续重复角色")
# ── 打印完整对话 ──────────────────────────────────────────────────────────────
print(f"
{'='*70}")
print(f"完整对话(共{len(msgs)}轮)")
print(f"{'='*70}
")
for i, m in enumerate(msgs):
content = m["content"]
if m["role"] == "system":
content = content[:200] + "..."
elif len(content) > 300:
content = content[:300] + "..."
print(f"[{i}] {m['role']}:
{content}
")
提供机构:
AmanPriyanshu



