Agentic-Coding-Tessa
收藏魔搭社区2026-01-06 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/smirki/Agentic-Coding-Tessa
下载链接
链接失效反馈官方服务:
资源简介:
# Agentic Coding Dataset for Tessa
A comprehensive dataset for training coding agents with tool-use, reasoning, and software engineering capabilities.
## Dataset Composition
This dataset combines multiple high-quality sources:
- **hermes_reasoning** (20.0%): Tool-use and reasoning dataset - [interstellarninja/hermes_reasoning_tool_use](https://huggingface.co/datasets/interstellarninja/hermes_reasoning_tool_use)
- **search_arena** (15.0%): Search and retrieval tasks - [lmarena-ai/search-arena-24k](https://huggingface.co/datasets/lmarena-ai/search-arena-24k)
- **arena_human_pref** (15.0%): Human preference data for alignment - [lmarena-ai/arena-human-preference-140k](https://huggingface.co/datasets/lmarena-ai/arena-human-preference-140k)
- **rstar_coder** (25.0%): Advanced coding problems with reasoning - [microsoft/rStar-Coder](https://huggingface.co/datasets/microsoft/rStar-Coder)
- **swe_bench** (25.0%): Software engineering trajectories - [SWE-bench/SWE-smith-trajectories](https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories)
## Dataset Statistics
- **Total samples**: 44,100
- **Format**: Axolotl-compatible conversation format
- **Fields**: `conversations` (list of turns with `from` and `value` keys)
## Usage with Axolotl
```yaml
datasets:
- path: smirki/Agentic-Coding-Tessa
type: chat_template
field_messages: conversations
message_property_mappings:
role: from
content: value
split: train
```
## Training Configuration for UIGEN-X
Recommended configuration for UIGEN-X-4B with this dataset:
```yaml
# Model
base_model: Tesslate/UIGEN-X-4B-0729
chat_template: chatml # For Qwen3-based models
# LoRA Configuration
adapter: lora
lora_r: 256
lora_alpha: 512
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
# Training
sequence_len: 8192 # Extended for code
micro_batch_size: 4
gradient_accumulation_steps: 4
num_epochs: 2
learning_rate: 5e-4
```
## Example Structure
```json
{
"conversations": [
{
"from": "system",
"value": "You are an expert programming assistant..."
},
{
"from": "human",
"value": "Help me implement a binary search algorithm"
},
{
"from": "gpt",
"value": "I'll help you implement binary search..."
}
],
"source": "dataset_name"
}
```
## License
Apache 2.0 (inherited from constituent datasets)
## Citation
```bibtex
@dataset{agentic_coding_tessa_2024,
title={Agentic Coding Dataset for Tessa},
author={Smirki},
year={2024},
publisher={HuggingFace}
}
```
# 面向Tessa的智能体编程数据集(Agentic Coding Dataset for Tessa)
本数据集为综合性训练数据集,用于赋能具备工具调用、逻辑推理与软件工程能力的编程智能体(AI Agent)。
## 数据集构成
本数据集整合了多个高质量数据源:
- **hermes_reasoning**(占比20.0%):工具调用与逻辑推理数据集,数据源链接:[interstellarninja/hermes_reasoning_tool_use](https://huggingface.co/datasets/interstellarninja/hermes_reasoning_tool_use)
- **search_arena**(占比15.0%):搜索与检索任务数据集,数据源链接:[lmarena-ai/search-arena-24k](https://huggingface.co/datasets/lmarena-ai/search-arena-24k)
- **arena_human_pref**(占比15.0%):用于对齐训练的人类偏好数据,数据源链接:[lmarena-ai/arena-human-preference-140k](https://huggingface.co/datasets/lmarena-ai/arena-human-preference-140k)
- **rstar_coder**(占比25.0%):带逻辑推理的高阶编程问题数据集,数据源链接:[microsoft/rStar-Coder](https://huggingface.co/datasets/microsoft/rStar-Coder)
- **swe_bench**(占比25.0%):软件工程轨迹数据集,数据源链接:[SWE-bench/SWE-smith-trajectories](https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories)
## 数据集统计信息
- **总样本量**:44100
- **数据格式**:兼容Axolotl的对话格式
- **数据字段**:`conversations`(包含多轮对话的列表,带有`from`与`value`两个键)
## Axolotl适配使用方法
yaml
datasets:
- path: smirki/Agentic-Coding-Tessa
type: chat_template
field_messages: conversations
message_property_mappings:
role: from
content: value
split: train
## UIGEN-X训练配置
针对UIGEN-X-4B模型使用本数据集的推荐配置如下:
yaml
# Model
base_model: Tesslate/UIGEN-X-4B-0729
chat_template: chatml # For Qwen3-based models
# LoRA Configuration
adapter: lora
lora_r: 256
lora_alpha: 512
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
# Training
sequence_len: 8192 # Extended for code
micro_batch_size: 4
gradient_accumulation_steps: 4
num_epochs: 2
learning_rate: 5e-4
## 示例数据结构
json
{
"conversations": [
{
"from": "system",
"value": "You are an expert programming assistant..."
},
{
"from": "human",
"value": "Help me implement a binary search algorithm"
},
{
"from": "gpt",
"value": "I'll help you implement binary search..."
}
],
"source": "dataset_name"
}
## 授权协议
Apache 2.0(继承自各组成数据集的授权协议)
## 引用格式
bibtex
@dataset{agentic_coding_tessa_2024,
title={Agentic Coding Dataset for Tessa},
author={Smirki},
year={2024},
publisher={HuggingFace}
}
提供机构:
maas
创建时间:
2025-08-12



