five

AmanPriyanshu/tool-reasoning-sft-RESEARCH-OpenSeeker-v1-Data

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/tool-reasoning-sft-RESEARCH-OpenSeeker-v1-Data
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation language: - en tags: - reasoning - tool-calling - agentic - multi-turn - deep-search - multi-step-reasoning size_categories: - 1K<n<10K --- # OpenSeeker v1 — Cleaned & Rectified 7,189 multi-turn deep-search agent trajectories converted into a strict reasoning + tool-call format with validated FSM transitions. ## Origin Derived from [OpenSeeker/OpenSeeker-v1-Data](https://huggingface.co/datasets/OpenSeeker/OpenSeeker-v1-Data). OpenSeeker is an open-source search agent system that democratizes access to frontier search capabilities by fully open-sourcing its training data. Fine-tuned on Qwen3-30B-A3B-Thinking-2507 with 11.7K training examples, it achieves state-of-the-art performance on frontier search benchmarks: 48.4 on BrowseComp-ZH, 29.5 on BrowseComp, 74.0 on xbench-DeepSearch, and 59.4 on WideSearch — surpassing Tongyi DeepResearch on BrowseComp-ZH (48.4% vs. 46.7%). 📄 **Paper:** [OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data](https://huggingface.co/datasets/OpenSeeker/OpenSeeker-v1-Data) ## Format Each row contains a structured multi-turn conversation with explicit reasoning traces and validated tool calls. ### Message Roles | Role | Content | |---|---| | `system` | Tool-use protocol + JSON tool schemas + QA agent instructions | | `user` | Research question | | `reasoning` | `<think>…</think>` — model's step-by-step reasoning | | `tool_call` | `<tool_call>{"name": "...", "arguments": {...}}</tool_call>` — function invocation | | `tool_output` | `<tool_response>…</tool_response>` — tool execution result | | `answer` | `<answer>…</answer>` — final synthesized response | ### Trajectory Structure ``` system → user → reasoning → [tool_call → tool_output → reasoning →]* answer ``` Trajectories range from 4 to 604 turns, with 0–200 tool calls per row (avg 37.3). ## Schema Single Parquet file with zstd compression. | Column | Type | Description | |---|---|---| | `messages` | string | Converted conversation (JSON list of `{role, content}`) | | `trajectory_correctness` | string | Source label: `Correct` or `Incorrect` | ## Tools 2 tools available per trajectory: | Tool | Description | |---|---| | `search` | Batched web search — supply array of query strings, retrieves top 10 results per query | | `visit` | Parse webpage(s) and return summary according to a goal | ## Filtering From the original 11,677 rows: - **All** rows with `trajectory correctness == "Correct"` retained (4,949) - **1/3** of rows with `trajectory correctness != "Correct"` randomly sampled (2,242 of 6,728) - Total input after filtering: 7,191 rows - After conversion (2 failures): **7,189 rows** ## Correctness Distribution | Label | Rows | |---|---| | `Correct` | 4,948 | | `Incorrect` | 2,241 | ## Conversion Details - Source data uses `<think>` tags in assistant messages and `<tool_calls_begin>/<tool_call>/<tool_calls_end>` wrapper format for tool calls — conversion is **decomposition** of compound assistant messages into separate FSM turns - Assistant messages with `<think>` + `<tool_calls_begin>` split into separate `reasoning` + `tool_call` turns - Assistant messages with `<think>` + `<answer>` split into separate `reasoning` + `answer` turns - User messages containing `<tool_response>` mapped to `tool_output` turns - Tool schemas extracted from `<tools>` XML block in system prompt, converted to clean JSON - Single constant system prompt across all rows (tool-augmented QA agent with `search` and `visit`) - Bridge reasoning synthesized only when FSM requires it (rare — source already has `<think>` blocks on 100% of assistant messages) - 99.97% conversion rate (7,189/7,191); 2 failures from `tool_call→reasoning` transition violations in source - Two validation layers: FSM transition check + content-tag non-empty check ## Statistics | Metric | Value | |---|---| | Tool calls per row | min=0, max=200, avg=37.3 | | Turns per row | min=4, max=604, avg=115.9 | | Correct avg tool calls | ~28 | | Incorrect avg tool calls | ~61 | ## Usage ```py import json, random, re from datasets import load_dataset VALID_NEXT = { "system": {"user"}, "user": {"reasoning"}, "reasoning": {"tool_call", "answer"}, "tool_call": {"tool_output"}, "tool_output": {"reasoning"}, "answer": {"user"}, } ds = load_dataset("AmanPriyanshu/tool-reasoning-sft-OpenSeeker-v1-Data", split="train") print(f"Loaded: {len(ds):,} rows\n") idx = random.randint(0, len(ds) - 1) row = ds[idx] msgs = json.loads(row["messages"]) corr = row["trajectory_correctness"] roles = [m["role"] for m in msgs] tc = sum(1 for r in roles if r == "tool_call") print(f"Row {idx} | correctness={corr} | {len(msgs)} turns | {tc} tool_calls") print(f"Roles: {' -> '.join(roles[:20])}{'...' if len(roles)>20 else ''}\n") # ── Validation 1: FSM transitions ──────────────────────────────────────────── bad = [(j, roles[j], roles[j+1]) for j in range(len(roles)-1) if roles[j+1] not in VALID_NEXT.get(roles[j], set())] if bad: print(f"!! FSM VIOLATIONS: {len(bad)}") for pos, a, b in bad[:5]: print(f" [{pos}] {a} -> {b}") else: print("✓ FSM transitions: all valid") # ── Validation 2: content tags ─────────────────────────────────────────────── tag_errors = [] for i, t in enumerate(msgs): r, c = t["role"], t["content"] if r == "reasoning": if not re.search(r'<think>.+</think>', c, re.DOTALL): tag_errors.append((i, r, "empty <think>")) elif r == "tool_call": if not re.search(r'<tool_call>.+</tool_call>', c, re.DOTALL): tag_errors.append((i, r, "empty <tool_call>")) else: blob = c[c.find("{"):c.rfind("}") + 1] try: obj = json.loads(blob) if "name" not in obj or "arguments" not in obj: tag_errors.append((i, r, "missing name/arguments")) except json.JSONDecodeError as e: tag_errors.append((i, r, f"invalid JSON: {e}")) elif r == "answer": if not re.search(r'<answer>.+</answer>', c, re.DOTALL): tag_errors.append((i, r, "empty <answer>")) elif r == "tool_output": if not re.search(r'<tool_response>.+</tool_response>', c, re.DOTALL): tag_errors.append((i, r, "empty <tool_response>")) if tag_errors: print(f"!! TAG ERRORS: {len(tag_errors)}") for pos, role, err in tag_errors[:5]: print(f" [{pos}] {role}: {err}") else: print("✓ Content tags: all valid") # ── Validation 3: structure checks ─────────────────────────────────────────── checks = [] if roles[0] != "system": checks.append("first role is not system") if roles[1] != "user": checks.append("second role is not user") if roles[-1] != "answer": checks.append(f"last role is {roles[-1]}, expected answer") if any(roles[i] == roles[i+1] for i in range(len(roles)-1)): dupes = [(i, roles[i]) for i in range(len(roles)-1) if roles[i] == roles[i+1]] checks.append(f"consecutive same-role at {dupes[0]}") if checks: print(f"!! STRUCTURE ISSUES: {len(checks)}") for c in checks: print(f" {c}") else: print("✓ Structure: system→user→...→answer, no consecutive duplicates") # ── Print turns ────────────────────────────────────────────────────────────── print(f"\n{'='*70}") print(f"FULL CONVERSATION ({len(msgs)} turns)") print(f"{'='*70}\n") for i, m in enumerate(msgs): content = m["content"] if m["role"] == "system": content = content[:200] + "..." elif len(content) > 300: content = content[:300] + "..." print(f"[{i}] {m['role']}:\n{content}\n") ```

--- 许可证:MIT 任务类别: - 文本生成 语言: - 英语 标签: - 推理 - 工具调用 - 智能体 - 多轮对话 - 深度搜索 - 多步推理 规模类别: - 1K<n<10K --- # OpenSeeker v1 — 清理校正版 7189条多轮深度搜索智能体轨迹被转换为严格的推理+工具调用格式,并经过有限状态机(Finite State Machine, FSM)转换验证。 ## 起源 本数据集源自[OpenSeeker/OpenSeeker-v1-Data](https://huggingface.co/datasets/OpenSeeker/OpenSeeker-v1-Data)。 OpenSeeker是一款开源搜索智能体系统,通过完全开源其训练数据,让前沿搜索能力的普惠获取成为可能。该模型基于Qwen3-30B-A3B-Thinking-2507微调,使用11.7K条训练样本,在前沿搜索基准测试中取得了顶尖性能:BrowseComp-ZH上48.4、BrowseComp上29.5、xbench-DeepSearch上74.0、WideSearch上59.4——在BrowseComp-ZH上超越了通义深度研究(Tongyi DeepResearch)(48.4% vs 46.7%)。 📄 **论文:** [OpenSeeker: Democratizing Frontier Search Agents by Fully Open-Sourcing Training Data](https://huggingface.co/datasets/OpenSeeker/OpenSeeker-v1-Data) ## 数据格式 每一行包含结构化的多轮对话,带有明确的推理轨迹与经过验证的工具调用。 ### 消息角色 | 角色 | 内容说明 | |---|---| | `system` | 工具使用协议 + JSON工具schema + 问答智能体指令 | | `user` | 研究问题 | | `reasoning` | `<think>…</think>` —— 模型的分步推理过程 | | `tool_call` | `<tool_call>{"name": "...", "arguments": {...}}</tool_call>` —— 函数调用 | | `tool_output` | `<tool_response>…</tool_response>` —— 工具执行结果 | | `answer` | `<answer>…</answer>` —— 最终合成的响应 | ### 轨迹结构 system → user → reasoning → [tool_call → tool_output → reasoning →]* answer 轨迹的轮次范围为4至604轮,每一行包含0至200次工具调用(平均37.3次)。 ## 数据结构 单个Parquet文件,采用zstd压缩算法。 | 列名 | 数据类型 | 说明 | |---|---|---| | `messages` | 字符串 | 转换后的对话(`{role, content}`格式的JSON列表) | | `trajectory_correctness` | 字符串 | 源数据标签:`Correct`(正确)或`Incorrect`(错误) | ## 可用工具 每条轨迹支持2种工具: | 工具名称 | 功能说明 | |---|---| | `search` | 批量网页搜索——传入查询字符串数组,为每个查询返回前10条搜索结果 | | `visit` | 解析指定网页,并根据目标任务生成网页摘要 | ## 数据筛选流程 从原始11677条数据中: - 保留所有`trajectory correctness == "Correct"`的行(共4949条) - 从`trajectory correctness != "Correct"`的行中随机采样1/3(6728条中采样2242条) - 筛选后总输入行数为7191 - 经过格式转换(2次失败)后最终得到**7189条**数据 ## 正确性分布 | 标签 | 数据行数 | |---|---| | `Correct`(正确) | 4948 | | `Incorrect`(错误) | 2241 | ## 转换细节 源数据在助手消息中使用`<think>`标签,工具调用采用`<tool_calls_begin>/<tool_call>/<tool_calls_end>`包装格式。本次转换工作为将复合助手消息分解为独立的有限状态机(FSM)轮次: 1. 将包含`<think>`+`<tool_calls_begin>`的助手消息拆分为独立的`reasoning`与`tool_call`轮次 2. 将包含`<think>`+`<answer>`的助手消息拆分为独立的`reasoning`与`answer`轮次 3. 将包含`<tool_response>`的用户消息映射为`tool_output`轮次 4. 从系统提示的`<tools>`XML块中提取工具schema,并转换为规范的JSON格式 5. 所有样本使用统一的系统提示(集成`search`与`visit`工具的增强型问答智能体) 6. 仅在有限状态机逻辑需要时合成桥接推理(场景极少——源数据100%的助手消息已自带`<think>`块) 7. 转换率达99.97%(7189/7191);2次失败源于源数据中`tool_call→reasoning`的转换违规 8. 包含两层验证机制:有限状态机转换检查与内容标签非空检查 ## 统计指标 | 指标 | 数值 | |---|---| | 每条样本的工具调用次数 | 最小值0,最大值200,平均值37.3 | | 每条样本的对话轮次 | 最小值4,最大值604,平均值115.9 | | 正确轨迹平均工具调用次数 | 约28次 | | 错误轨迹平均工具调用次数 | 约61次 | ## 使用示例 py import json, random, re from datasets import load_dataset # 定义合法的角色转移规则 VALID_NEXT = { "system": {"user"}, "user": {"reasoning"}, "reasoning": {"tool_call", "answer"}, "tool_call": {"tool_output"}, "tool_output": {"reasoning"}, "answer": {"user"}, } # 加载训练集 ds = load_dataset("AmanPriyanshu/tool-reasoning-sft-OpenSeeker-v1-Data", split="train") print(f"已加载:{len(ds):,} 条数据 ") # 随机选取一条样本 idx = random.randint(0, len(ds) - 1) row = ds[idx] msgs = json.loads(row["messages"]) corr = row["trajectory_correctness"] roles = [m["role"] for m in msgs] tc = sum(1 for r in roles if r == "tool_call") print(f"样本 {idx} | 正确性标签={corr} | {len(msgs)} 轮对话 | {tc} 次工具调用") print(f"对话角色顺序:{' -> '.join(roles[:20])}{'...' if len(roles)>20 else ''} ") # ── 验证1:有限状态机转移检查 ──────────────────────────────────────────── bad = [(j, roles[j], roles[j+1]) for j in range(len(roles)-1) if roles[j+1] not in VALID_NEXT.get(roles[j], set())] if bad: print(f"!! 发现有限状态机违规:{len(bad)} 处") for pos, a, b in bad[:5]: print(f" [{pos}] {a} -> {b}") else: print("✓ 有限状态机转移:全部合法") # ── 验证2:内容标签检查 ─────────────────────────────────────────────── tag_errors = [] for i, t in enumerate(msgs): r, c = t["role"], t["content"] if r == "reasoning": if not re.search(r'<think>.+</think>', c, re.DOTALL): tag_errors.append((i, r, "缺少<think>标签或内容为空")) elif r == "tool_call": if not re.search(r'<tool_call>.+</tool_call>', c, re.DOTALL): tag_errors.append((i, r, "缺少<tool_call>标签或内容为空")) else: blob = c[c.find("{"):c.rfind("}") + 1] try: obj = json.loads(blob) if "name" not in obj or "arguments" not in obj: tag_errors.append((i, r, "缺少工具名称或参数字段")) except json.JSONDecodeError as e: tag_errors.append((i, r, f"JSON格式无效:{e}")) elif r == "answer": if not re.search(r'<answer>.+</answer>', c, re.DOTALL): tag_errors.append((i, r, "缺少<answer>标签或内容为空")) elif r == "tool_output": if not re.search(r'<tool_response>.+</tool_response>', c, re.DOTALL): tag_errors.append((i, r, "缺少<tool_response>标签或内容为空")) if tag_errors: print(f"!! 发现内容标签错误:{len(tag_errors)} 处") for pos, role, err in tag_errors[:5]: print(f" [{pos}] {role}: {err}") else: print("✓ 内容标签:全部合法") # ── 验证3:对话结构检查 ─────────────────────────────────────────── checks = [] if roles[0] != "system": checks.append("首个角色不是system") if roles[1] != "user": checks.append("第二个角色不是user") if roles[-1] != "answer": checks.append(f"最后一个角色为{roles[-1]},预期应为answer") if any(roles[i] == roles[i+1] for i in range(len(roles)-1)): dupes = [(i, roles[i]) for i in range(len(roles)-1) if roles[i] == roles[i+1]] checks.append(f"存在连续重复角色,首现位置:{dupes[0]}") if checks: print(f"!! 发现结构问题:{len(checks)} 处") for c in checks: print(f" {c}") else: print("✓ 对话结构:符合system→user→...→answer格式,无连续重复角色") # ── 打印完整对话 ────────────────────────────────────────────────────────────── print(f" {'='*70}") print(f"完整对话(共{len(msgs)}轮)") print(f"{'='*70} ") for i, m in enumerate(msgs): content = m["content"] if m["role"] == "system": content = content[:200] + "..." elif len(content) > 300: content = content[:300] + "..." print(f"[{i}] {m['role']}: {content} ")
提供机构:
AmanPriyanshu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作