five

sroecker/hermes-agent-traces-chatml

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/sroecker/hermes-agent-traces-chatml
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: messages list: - name: role dtype: string - name: content dtype: string - name: source dtype: string splits: - name: train num_bytes: 940104210 num_examples: 18487 download_size: 939906839 dataset_size: 940104210 configs: - config_name: default data_files: - split: train path: data/train-* license: apache-2.0 task_categories: - text-generation language: - en tags: - tool-calling - function-calling - agent - hermes - reasoning - chatml - sft size_categories: - 10K<n<100K --- # Hermes Agent Traces — ChatML Format A ready-to-train dataset of **18,487 multi-turn tool-calling conversations** in ChatML `messages` format, combining Hermes Agent reasoning traces with NousResearch function-calling data. Built for SFT training of tool-calling / agentic LLMs with [TRL's SFTTrainer](https://huggingface.co/docs/trl/sft_trainer). ## Quick Start ```python from datasets import load_dataset from trl import SFTTrainer dataset = load_dataset("sroecker/hermes-agent-traces-chatml", split="train") trainer = SFTTrainer( model="Qwen/Qwen3-0.6B", train_dataset=dataset, ) trainer.train() ``` ## Schema | Column | Type | Description | |--------|------|-------------| | `messages` | `list[{role, content}]` | Multi-turn conversation in ChatML format | | `source` | `string` | Origin dataset: `"hermes-traces"` or `"nous-fc"` | Message roles: `system`, `user`, `assistant`, `tool` ## Source Datasets | Source | Config | Samples | Description | |--------|--------|---------|-------------| | [lambda/hermes-agent-reasoning-traces](https://huggingface.co/datasets/lambda/hermes-agent-reasoning-traces) | `kimi` | 7,646 | Multi-turn agentic traces from Kimi-K2.5, avg 24.3 turns, 13.9 tool calls per sample | | [lambda/hermes-agent-reasoning-traces](https://huggingface.co/datasets/lambda/hermes-agent-reasoning-traces) | `glm-5.1` | 7,055 | Multi-turn agentic traces from GLM-5.1, avg 19.1 turns, 9.7 tool calls per sample | | [NousResearch/hermes-function-calling-v1](https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1) | `func_calling_singleturn` | 1,893 | Single-turn function calling across diverse domains | | [NousResearch/hermes-function-calling-v1](https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1) | `func_calling` | 1,893 | Multi-turn function calling conversations | ## Processing Steps The dataset was created by the following pipeline: ### 1. Format conversion (ShareGPT → ChatML) All source datasets use ShareGPT format (`from`/`value` keys). These were converted to ChatML (`role`/`content`): | ShareGPT `from` | ChatML `role` | |-----------------|---------------| | `system` | `system` | | `human` | `user` | | `gpt` | `assistant` | | `tool` | `tool` | ### 2. System prompt condensation (Hermes traces only) The original Hermes Agent system prompts are **~25,000 chars / ~6,200 tokens** each because they embed full tool JSON schemas inline. These were replaced with a condensed ~90-token instruction: ``` You are a function calling AI model. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. For each function call return a JSON object with the following schema: {"name": <function-name>, "arguments": <args-dict>} Each function call should be enclosed within <tool_call> </tool_call> XML tags. Function results will be provided within <tool_response> </tool_response> XML tags. ``` **Why?** The original system prompts consumed ~75% of a typical training window (8,192 tokens), leaving almost no room for the actual tool-calling conversation. By condensing the system prompt, the model sees far more of the multi-turn interaction patterns during training. The tool-calling format (`<tool_call>`, `<tool_response>`, `<think>`) is learned from the conversation turns themselves, not from the schema in the system prompt. ### 3. Filtering Examples were filtered to require: - At least 3 messages - At least one `assistant` turn ### 4. Concatenation & shuffling All four source splits were concatenated and shuffled with `seed=42`. ## Conversation Format Assistant messages contain inline XML blocks for reasoning and tool use: ```xml <think> The user wants me to search for files. Let me use the search tool. </think> <tool_call> {"name": "search_files", "arguments": {"query": "payment processing"}} </tool_call> ``` Tool responses appear as: ```xml <tool_response> {"tool_call_id": "call_123", "name": "search_files", "content": {"results": [...]}} </tool_response> ``` These special tokens (`<tool_call>`, `</tool_call>`, `<tool_response>`, `</tool_response>`, `<think>`, `</think>`) are natively supported by Qwen3's tokenizer as dedicated token IDs. ## Task Categories The dataset covers a wide range of agentic tasks: - **Terminal & Coding** — script writing, debugging, environment setup - **Agent Tools** — memory persistence, task delegation, skill management, todo planning - **Repository Tasks** — bug fixes, feature implementation, code review, refactoring - **Browser Automation** — Playwright-based navigation, scraping, form filling - **File Operations** — reading, writing, patching files - **Scheduling & Planning** — task organization, time management - **IoT & Home Automation** — smart device control (from NousResearch data) - **Multi-Tool** — complex tasks requiring multiple tool types ## Token Length Distribution With the condensed system prompts (measured with Qwen3 tokenizer): | Percentile | Tokens | |-----------|--------| | P10 | ~1,200 | | P25 | ~4,900 | | P50 (median) | ~17,000 | | P75 | ~49,700 | | P90 | ~85,400 | Recommended `max_length` settings: - `4096`: captures ~21% of examples fully - `8192`: captures ~31% of examples fully - `16384`: captures ~49% of examples fully Longer examples are truncated from the right. With `assistant_only_loss=True`, the truncated system/user prefix tokens don't contribute to loss anyway. ## License Apache 2.0 (inherited from source datasets)
提供机构:
sroecker
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作