five

Agentic-Coding-Tessa

收藏
魔搭社区2026-01-06 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/smirki/Agentic-Coding-Tessa
下载链接
链接失效反馈
官方服务:
资源简介:
# Agentic Coding Dataset for Tessa A comprehensive dataset for training coding agents with tool-use, reasoning, and software engineering capabilities. ## Dataset Composition This dataset combines multiple high-quality sources: - **hermes_reasoning** (20.0%): Tool-use and reasoning dataset - [interstellarninja/hermes_reasoning_tool_use](https://huggingface.co/datasets/interstellarninja/hermes_reasoning_tool_use) - **search_arena** (15.0%): Search and retrieval tasks - [lmarena-ai/search-arena-24k](https://huggingface.co/datasets/lmarena-ai/search-arena-24k) - **arena_human_pref** (15.0%): Human preference data for alignment - [lmarena-ai/arena-human-preference-140k](https://huggingface.co/datasets/lmarena-ai/arena-human-preference-140k) - **rstar_coder** (25.0%): Advanced coding problems with reasoning - [microsoft/rStar-Coder](https://huggingface.co/datasets/microsoft/rStar-Coder) - **swe_bench** (25.0%): Software engineering trajectories - [SWE-bench/SWE-smith-trajectories](https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories) ## Dataset Statistics - **Total samples**: 44,100 - **Format**: Axolotl-compatible conversation format - **Fields**: `conversations` (list of turns with `from` and `value` keys) ## Usage with Axolotl ```yaml datasets: - path: smirki/Agentic-Coding-Tessa type: chat_template field_messages: conversations message_property_mappings: role: from content: value split: train ``` ## Training Configuration for UIGEN-X Recommended configuration for UIGEN-X-4B with this dataset: ```yaml # Model base_model: Tesslate/UIGEN-X-4B-0729 chat_template: chatml # For Qwen3-based models # LoRA Configuration adapter: lora lora_r: 256 lora_alpha: 512 lora_dropout: 0.05 lora_target_modules: - q_proj - k_proj - v_proj - o_proj - gate_proj - up_proj - down_proj # Training sequence_len: 8192 # Extended for code micro_batch_size: 4 gradient_accumulation_steps: 4 num_epochs: 2 learning_rate: 5e-4 ``` ## Example Structure ```json { "conversations": [ { "from": "system", "value": "You are an expert programming assistant..." }, { "from": "human", "value": "Help me implement a binary search algorithm" }, { "from": "gpt", "value": "I'll help you implement binary search..." } ], "source": "dataset_name" } ``` ## License Apache 2.0 (inherited from constituent datasets) ## Citation ```bibtex @dataset{agentic_coding_tessa_2024, title={Agentic Coding Dataset for Tessa}, author={Smirki}, year={2024}, publisher={HuggingFace} } ```

# 面向Tessa的智能体编程数据集(Agentic Coding Dataset for Tessa) 本数据集为综合性训练数据集,用于赋能具备工具调用、逻辑推理与软件工程能力的编程智能体(AI Agent)。 ## 数据集构成 本数据集整合了多个高质量数据源: - **hermes_reasoning**(占比20.0%):工具调用与逻辑推理数据集,数据源链接:[interstellarninja/hermes_reasoning_tool_use](https://huggingface.co/datasets/interstellarninja/hermes_reasoning_tool_use) - **search_arena**(占比15.0%):搜索与检索任务数据集,数据源链接:[lmarena-ai/search-arena-24k](https://huggingface.co/datasets/lmarena-ai/search-arena-24k) - **arena_human_pref**(占比15.0%):用于对齐训练的人类偏好数据,数据源链接:[lmarena-ai/arena-human-preference-140k](https://huggingface.co/datasets/lmarena-ai/arena-human-preference-140k) - **rstar_coder**(占比25.0%):带逻辑推理的高阶编程问题数据集,数据源链接:[microsoft/rStar-Coder](https://huggingface.co/datasets/microsoft/rStar-Coder) - **swe_bench**(占比25.0%):软件工程轨迹数据集,数据源链接:[SWE-bench/SWE-smith-trajectories](https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories) ## 数据集统计信息 - **总样本量**:44100 - **数据格式**:兼容Axolotl的对话格式 - **数据字段**:`conversations`(包含多轮对话的列表,带有`from`与`value`两个键) ## Axolotl适配使用方法 yaml datasets: - path: smirki/Agentic-Coding-Tessa type: chat_template field_messages: conversations message_property_mappings: role: from content: value split: train ## UIGEN-X训练配置 针对UIGEN-X-4B模型使用本数据集的推荐配置如下: yaml # Model base_model: Tesslate/UIGEN-X-4B-0729 chat_template: chatml # For Qwen3-based models # LoRA Configuration adapter: lora lora_r: 256 lora_alpha: 512 lora_dropout: 0.05 lora_target_modules: - q_proj - k_proj - v_proj - o_proj - gate_proj - up_proj - down_proj # Training sequence_len: 8192 # Extended for code micro_batch_size: 4 gradient_accumulation_steps: 4 num_epochs: 2 learning_rate: 5e-4 ## 示例数据结构 json { "conversations": [ { "from": "system", "value": "You are an expert programming assistant..." }, { "from": "human", "value": "Help me implement a binary search algorithm" }, { "from": "gpt", "value": "I'll help you implement binary search..." } ], "source": "dataset_name" } ## 授权协议 Apache 2.0(继承自各组成数据集的授权协议) ## 引用格式 bibtex @dataset{agentic_coding_tessa_2024, title={Agentic Coding Dataset for Tessa}, author={Smirki}, year={2024}, publisher={HuggingFace} }
提供机构:
maas
创建时间:
2025-08-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作