DJLougen/Acta
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/DJLougen/Acta
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
library_name: datasets
tags:
- agentic
- tool-use
- function-calling
- curated
- high-quality
- conversations
size_categories:
- 1K<n<10K
---
# Acta
*Acta* (Latin: "acts, deeds, records") - A premium curated sample of high-quality agentic tool-use conversations, filtered using an 8-factor quality model based on statistical correlation analysis of diversity metrics.
## Overview
This public sample contains **5,000 conversations** from the full 37K Acta dataset. It was created by analyzing 5 major agentic datasets (FlameFox, Hermes Reasoning, Hermes Multi-turn, Smolagents, SWE Agent) and applying evidence-based quality filters to maximize training signal density.
**Key Statistics:**
- **Train:** 4,500 samples
- **Validation:** 500 samples
- **Avg semantic diversity:** 0.53
- **Avg lexical diversity:** 0.47
- **Sources:** 5 datasets harmonized
## The 8-Factor Quality Model
Each sample was scored using 8 factors weighted by their correlation with training quality:
| Factor | Weight | Threshold | Rationale |
|--------|--------|-----------|-----------|
| Lexical Diversity | 0.25 | >=0.35 | unique_words / total_words (r=+0.967 with quality) |
| Semantic Diversity | 0.20 | >=0.40 | content richness excluding stopwords (r=+0.947) |
| Verb Uniqueness | 0.15 | >=0.01 | unique_verbs / total_verbs (r=+0.852) |
| Turn Efficiency | 0.15 | 2-12 turns | Optimal conversation length (longer = repetitive) |
| Tool Pattern Novelty | 0.10 | - | Penalty for over-represented tool sequences |
| Reasoning Density | 0.08 | - | Thinking blocks per assistant turn |
| Role Balance | 0.05 | - | User/assistant turn ratio |
| Content Brevity | 0.02 | - | Information density (shorter = denser signal) |
## Source Distribution
| Source | Samples | Avg Quality | Characteristics |
|--------|---------|-------------|-----------------|
| FlameFox Agentic | ~2,600 | 0.819 | Highest diversity, balanced tools |
| Hermes Reasoning | ~1,400 | 0.598 | Good semantic diversity |
| Hermes Multi-turn | ~450 | 0.534 | Multi-turn, deduplicated |
| Smolagents Code | ~70 | 0.592 | Low redundancy |
| SWE Agent GLM | ~20 | 0.433 | Shortest, least repetitive traces |
## Usage
```python
from datasets import load_dataset
# Load the curated sample
dataset = load_dataset("DJLougen/Acta")
# Access quality metrics
sample = dataset["train"][0]
print(sample["quality_score"]) # 0.0 - 1.0
print(sample["semantic_diversity"]) # 0.546
print(sample["lexical_diversity"]) # 0.400
```
## Key Findings
1. **Quality ≠ Quantity** - 37K curated samples > 200K raw samples
2. **Lexical diversity is the strongest quality predictor** (r=+0.967)
3. **More content ≠ better signal** - SWE Agent has 5x more text but 3x lower quality
4. **4-10 turns optimal** - longer conversations become repetitive (r=-0.982)
5. **Tool redundancy is rampant** - some patterns repeat 3000+ times in raw data
## Citation
```bibtex
@dataset{acta_2026,
title = {Acta: A Quality-Curated Agentic Tool-Use Dataset},
author = {Lougen, Daniel},
year = {2026},
url = {https://huggingface.co/datasets/DJLougen/Acta}
}
```
## License
Apache 2.0
## Acknowledgments
Source datasets:
- [FlameF0X/agentic-code](https://huggingface.co/datasets/FlameF0X/agentic-code)
- [interstellarninja/hermes_reasoning_tool_use](https://huggingface.co/datasets/interstellarninja/hermes_reasoning_tool_use)
- [interstellarninja/tool-use-multiturn-reasoning](https://huggingface.co/datasets/interstellarninja/tool-use-multiturn-reasoning)
- [smolagents/codeagent-traces](https://huggingface.co/datasets/smolagents/codeagent-traces)
- [DCAgent/neulab-nebius-swe-agent-trajectories-sandboxes_glm_4.7_traces_jupiter](https://huggingface.co/datasets/DCAgent/neulab-nebius-swe-agent-trajectories-sandboxes_glm_4.7_traces_jupiter)
---
**Full 37K proprietary dataset:** [DJLougen/Acta-Proprietary](https://huggingface.co/datasets/DJLougen/Acta-Proprietary)
language:
- 英语
license: apache-2.0
许可证:Apache 2.0
library_name: datasets
库名称:datasets
tags:
- 智能体相关(agentic)
- 工具使用(tool-use)
- 函数调用(function-calling)
- 精选(curated)
- 高质量(high-quality)
- 对话(conversations)
size_categories:
- 1K<n<10K
规模类别:1千 < n < 1万
# Acta
*Acta*(拉丁语意为“行动、事迹、记录”)是一份经过精选的高质量智能体工具使用对话样本集,基于对多样性指标的统计相关性分析,通过八因子质量模型完成筛选。
## 概述
本公开样本集包含完整37K条Acta数据集中的**5000条对话**。该数据集通过分析5个主流智能体数据集(FlameFox、Hermes Reasoning、Hermes Multi-turn、Smolagents、SWE Agent),并应用循证质量过滤方法以最大化训练信号密度而构建。
**核心统计数据:**
- **训练集:** 4500个样本
- **验证集:** 500个样本
- **平均语义多样性:** 0.53
- **平均词汇多样性:** 0.47
- **数据源:** 5个经过对齐的数据集
## 八因子质量模型
每个样本均基于其与训练质量的相关性赋予权重,通过以下8个因子进行评分:
| 因子 | 权重 | 阈值 | 说明 |
|--------|--------|-----------|-----------|
| 词汇多样性(Lexical Diversity) | 0.25 | ≥0.35 | 唯一词数/总词数(与质量的相关系数*r*=+0.967) |
| 语义多样性(Semantic Diversity) | 0.20 | ≥0.40 | 去除停用词后的内容丰富度(相关系数*r*=+0.947) |
| 动词独特性(Verb Uniqueness) | 0.15 | ≥0.01 | 唯一动词数/总动词数(相关系数*r*=+0.852) |
| 轮次效率(Turn Efficiency) | 0.15 | 2-12轮 | 最优对话长度(过长则易出现重复内容) |
| 工具模式新颖性(Tool Pattern Novelty) | 0.10 | - | 对过度重复的工具序列施加惩罚 |
| 推理密度(Reasoning Density) | 0.08 | - | 每轮助手回复中的思考块数量 |
| 角色平衡性(Role Balance) | 0.05 | - | 用户与助手的轮次比例 |
| 内容简洁性(Content Brevity) | 0.02 | - | 信息密度(内容越简短则信号密度越高) |
## 数据源分布
| 数据源 | 样本数量 | 平均质量得分 | 特性 |
|--------|---------|-------------|-----------------|
| FlameFox Agentic | ~2600 | 0.819 | 多样性最高,工具使用均衡 |
| Hermes Reasoning | ~1400 | 0.598 | 语义多样性表现优异 |
| Hermes Multi-turn | ~450 | 0.534 | 多轮对话,已去重 |
| Smolagents Code | ~70 | 0.592 | 冗余度极低 |
| SWE Agent GLM | ~20 | 0.433 | 篇幅最短,重复痕迹最少 |
## 使用方法
python
from datasets import load_dataset
# 加载精选样本集
dataset = load_dataset("DJLougen/Acta")
# 访问质量指标
sample = dataset["train"][0]
print(sample["quality_score"]) # 取值范围0.0 - 1.0
print(sample["semantic_diversity"]) # 0.546
print(sample["lexical_diversity"]) # 0.400
## 核心发现
1. **质量≠数量**:37K条精选样本优于200K条原始样本
2. **词汇多样性是最强的质量预测因子**(相关系数*r*=+0.967)
3. **内容更多≠信号更佳**:SWE Agent数据集的文本量是其他数据集的5倍,但质量仅为其三分之一
4. **4-10轮为最优对话长度**:过长的对话易出现重复(相关系数*r*=-0.982)
5. **工具使用模式重复现象严重**:原始数据中部分模式重复次数超过3000次
## 引用格式
bibtex
@dataset{acta_2026,
title = {Acta: 经质量精选的智能体工具使用数据集},
author = {Lougen, Daniel},
year = {2026},
url = {https://huggingface.co/datasets/DJLougen/Acta}
}
## 许可证
Apache 2.0
## 致谢
数据源包括:
- [FlameF0X/agentic-code](https://huggingface.co/datasets/FlameF0X/agentic-code)
- [interstellarninja/hermes_reasoning_tool_use](https://huggingface.co/datasets/interstellarninja/hermes_reasoning_tool_use)
- [interstellarninja/tool-use-multiturn-reasoning](https://huggingface.co/datasets/interstellarninja/tool-use-multiturn-reasoning)
- [smolagents/codeagent-traces](https://huggingface.co/datasets/smolagents/codeagent-traces)
- [DCAgent/neulab-nebius-swe-agent-trajectories-sandboxes_glm_4.7_traces_jupiter](https://huggingface.co/datasets/DCAgent/neulab-nebius-swe-agent-trajectories-sandboxes_glm_4.7_traces_jupiter)
---
**完整37K条专有数据集:** [DJLougen/Acta-Proprietary](https://huggingface.co/datasets/DJLougen/Acta-Proprietary)
提供机构:
DJLougen



