five

AmanPriyanshu/tool-reasoning-sft-TOOLS-hermes-reasoning-tool-style-data-cleaned-rectified-115k

收藏
Hugging Face2026-03-10 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/tool-reasoning-sft-TOOLS-hermes-reasoning-tool-style-data-cleaned-rectified-115k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - en tags: - deep-research - reasoning - tool-calling - agentic - multi-hop - search size_categories: - 1K<n<10K --- --- ## Agentic Tool-Use SFT Mix 111,295 additional multi-turn agentic trajectories across four task families, following the same strict reasoning + tool-call FSM format. Combined with the original 3,827 deep-research trajectories, the dataset totals **115,122 samples**. ### Distribution | Category | Samples | Full | Compact | |---|---|---|---| | Deep Research (original) | 3,827 | 100% | — | | Multi-Turn Tool Orchestration | 45,776 | 54% | 46% | | Deep Research | 34,282 | 71% | 29% | | Codebase Retrieval | 17,473 | 69% | 31% | | Database Interaction | 13,764 | 69% | 31% | | **Total** | **115,122** | | | ### Schema Two columns: `messages` (JSON string — list of role/content dicts) and `source` (category label). ### Cleaning All trajectories validated against the strict FSM. Stray turns stripped, missing reasoning bridges inserted, consecutive reasoning merged. ~11k trajectories required at least one repair. ``` system → user → reasoning → tool_call → tool_output → reasoning → tool_call → ... → reasoning → answer ``` ## Validated Transitions ``` system → user user → reasoning reasoning → tool_call | answer tool_call → tool_output tool_output → reasoning answer → user (multi-turn only) ``` ## Usage ```py import json, random from huggingface_hub import hf_hub_download import pyarrow.parquet as pq REPO = "AmanPriyanshu/tool-reasoning-sft-hermes-reasoning-tool-style-data-cleaned-rectified-115k" FILES = ["compiled_data.parquet", "data.parquet"] for fname in FILES: print("=" * 70) print(f"Downloading {fname}...") local = hf_hub_download(REPO, fname, repo_type="dataset") t = pq.read_table(local) print(f"Rows: {t.num_rows:,} | Columns: {t.column_names}") idx = random.randint(0, t.num_rows - 1) row = {col: t.column(col)[idx].as_py() for col in t.column_names} msgs = json.loads(row["messages"]) meta = {k: v for k, v in row.items() if k != "messages"} print(f"\nRow {idx} | meta={meta} | {len(msgs)} turns") print(f"Roles: {' -> '.join(m['role'] for m in msgs[:20])}{'...' if len(msgs) > 20 else ''}\n") for m in msgs: content = m["content"] if m["role"] == "system": content = content[:200] + "..." elif len(content) > 300: content = content[:300] + "..." print(f"[{m['role']}]\n{content}\n") print() ``` ## License Apache-2.0
提供机构:
AmanPriyanshu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作