DJLougen/Acta-Synthetic
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/DJLougen/Acta-Synthetic
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
library_name: datasets
tags:
- agentic
- tool-use
- function-calling
- synthetic
- curated
- high-quality
- conversations
size_categories:
- 1K<n<10K
---
# Acta-Synthetic
High-quality synthetic agentic tool-use conversations generated with 8-factor quality control.
## Overview
This dataset contains **5,000 synthetic agentic tool-use conversations** generated using a template-based synthesis pipeline with strict quality controls.
**Key Statistics:**
- **Train:** 4,500 samples
- **Validation:** 500 samples
- **Avg quality score:** 0.824
- **Avg semantic diversity:** 0.914
- **Avg lexical diversity:** 0.819
- **Pass rate:** 96.2%
## Generation Method
Conversations were generated using the **8-Factor Quality Model**:
| Factor | Weight | Rationale |
|--------|--------|-----------|
| Lexical Diversity | 0.25 | unique_words / total_words (r=+0.967 with quality) |
| Semantic Diversity | 0.20 | content richness (r=+0.947) |
| Verb Uniqueness | 0.15 | unique_verbs / total_verbs (r=+0.852) |
| Turn Efficiency | 0.15 | Optimal 4-10 turns |
| Tool Pattern Novelty | 0.10 | Penalty for over-represented patterns |
| Reasoning Density | 0.08 | Thinking blocks per turn |
| Role Balance | 0.05 | User/assistant ratio |
| Content Brevity | 0.02 | Information density |
**Quality threshold:** 0.75 (only samples scoring above this were kept)
## Domains
- **Database queries:** 15.3%
- **Math problems:** 14.8%
- **API integration:** 12.7%
- **Coding tasks:** 12.3%
- **File processing:** 11.6%
- **Research:** 11.1%
- **Multi-step research:** 11.1%
- **Data analysis:** 11.1%
## Tools Available
- `web_search` - Web search capability
- `wikipedia_search` - Wikipedia knowledge retrieval
- `python_interpreter` - Code execution
- `code_analyzer` - Code analysis and review
- `read_file` - File reading
- `write_file` - File writing
- `api_request` - HTTP API calls
- `database_query` - SQL database queries
- `calculator` - Mathematical calculations
- `data_transform` - Data transformation
## Comparison with Real Data
Validating against [DJLougen/Acta](https://huggingface.co/datasets/DJLougen/Acta) (real agentic data):
| Metric | Real (Acta) | Synthetic | Difference |
|--------|-------------|-----------|------------|
| Quality Score | 0.788 ± 0.052 | 0.826 ± 0.038 | **+0.038** |
| Acceptance Rate | - | 57.2% | - |
| Unique Patterns | - | 234 | - |
**Synthetic data achieves higher quality scores than real curated data** due to:
1. Strict 8-factor filtering during generation
2. Pattern deduplication (max 50 samples per tool pattern)
3. Optimized conversation structure (4-10 turns)
4. Controlled diversity in vocabulary and reasoning
## Usage
```python
from datasets import load_dataset
# Load synthetic dataset
dataset = load_dataset("DJLougen/Acta-Synthetic")
# Access quality metrics
sample = dataset["train"][0]
print(sample["quality_score"]) # 0.824
print(sample["semantic_diversity"]) # 0.914
print(sample["lexical_diversity"]) # 0.819
print(sample["tools_used"]) # ["python_interpreter", "calculator"]
```
## Mixing with Real Data
```python
from datasets import load_dataset
import random
# Load both datasets
real = load_dataset("DJLougen/Acta")["train"]
synthetic = load_dataset("DJLougen/Acta-Synthetic")["train"]
# Mix with 30% synthetic ratio
real_samples = [real[i] for i in range(min(3500, len(real)))]
synthetic_samples = [synthetic[i] for i in range(min(1500, len(synthetic)))]
mixed = real_samples + synthetic_samples
random.shuffle(mixed)
print(f"Mixed dataset: {len(mixed)} samples")
```
## Generation Pipeline
The generation pipeline is available in the [Acta repository](https://huggingface.co/datasets/DJLougen/Acta).
Key components:
- `SyntheticConversationGenerator` - Template-based generation
- `LLMSyntheticGenerator` - LLM-enhanced generation (optional)
- `AgenticDataQualityModel` - 8-factor quality scoring
- `SyntheticDataPipeline` - End-to-end workflow
## Quality Validation
All samples pass strict quality thresholds:
- Lexical diversity ≥ 0.35
- Semantic diversity ≥ 0.40
- Verb uniqueness ≥ 0.01
- Turn count: 2-12
**Target quality score:** ≥ 0.75
## Citation
```bibtex
@dataset{acta_synthetic_2026,
title = {Acta-Synthetic: High-Quality Synthetic Agentic Tool-Use Dataset},
author = {Lougen, Daniel},
year = {2026},
url = {https://huggingface.co/datasets/DJLougen/Acta-Synthetic}
}
```
## License
Apache 2.0
## Related Datasets
- [DJLougen/Acta](https://huggingface.co/datasets/DJLougen/Acta) - Public sample (5K real samples)
- [DJLougen/Acta-Proprietary](https://huggingface.co/datasets/DJLougen/Acta-Proprietary) - Full 37K real samples (private)
- [DJLougen/ornstein-proprietary-100k](https://huggingface.co/datasets/DJLougen/ornstein-proprietary-100k) - 82K samples with 8-factor metrics
---
语言:
- 英语(en)
许可证:Apache 2.0
库名称:datasets
标签:
- 智能体(agentic)
- 工具使用(tool-use)
- 函数调用(function-calling)
- 合成数据(synthetic)
- 精选(curated)
- 高质量(high-quality)
- 对话(conversations)
样本规模类别:
- 1000 < 样本数量 < 10000
---
# Acta-Synthetic
采用八要素质量管控生成的高质量合成智能体工具使用对话集。
## 数据集概览
本数据集包含**5000条合成智能体工具使用对话**,通过基于模板的合成流程结合严格的质量管控生成。
**核心统计数据:**
- 训练集:4500条样本
- 验证集:500条样本
- 平均质量得分:0.824
- 平均语义多样性:0.914
- 平均词汇多样性:0.819
- 通过率:96.2%
## 生成方法
对话通过**八要素质量模型**生成:
| 要素 | 权重 | 原理说明 |
|--------|--------|-----------|
| 词汇多样性(Lexical Diversity) | 0.25 | unique_words / total_words(与质量的相关系数r=+0.967) |
| 语义多样性(Semantic Diversity) | 0.20 | 内容丰富度(相关系数r=+0.947) |
| 动词独特性(Verb Uniqueness) | 0.15 | unique_verbs / total_verbs(相关系数r=+0.852) |
| 对话轮次效率(Turn Efficiency) | 0.15 | 最优轮次为4-10轮 |
| 工具模式新颖性(Tool Pattern Novelty) | 0.10 | 对过度重复的模式施加惩罚 |
| 推理密度(Reasoning Density) | 0.08 | 每轮对话中的思考模块数量 |
| 角色平衡性(Role Balance) | 0.05 | 用户/助手对话占比 |
| 内容简洁性(Content Brevity) | 0.02 | 信息密度 |
**质量阈值:0.75(仅保留得分高于该阈值的样本)**
## 应用领域
- 数据库查询:15.3%
- 数学问题:14.8%
- API集成:12.7%
- 编码任务:12.3%
- 文件处理:11.6%
- 基础研究:11.1%
- 多步研究:11.1%
- 数据分析:11.1%
## 可用工具
- `web_search`:网页搜索功能
- `wikipedia_search`:维基百科知识检索
- `python_interpreter`:代码执行
- `code_analyzer`:代码分析与审查
- `read_file`:文件读取
- `write_file`:文件写入
- `api_request`:HTTP API调用
- `database_query`:SQL数据库查询
- `calculator`:数学计算
- `data_transform`:数据转换
## 与真实数据集对比
基于[DJLougen/Acta](https://huggingface.co/datasets/DJLougen/Acta)(真实智能体对话数据集)进行验证:
| 指标 | 真实数据集(Acta) | 合成数据集 | 差值 |
|--------|-------------|-----------|------------|
| 质量得分 | 0.788 ± 0.052 | 0.826 ± 0.038 | **+0.038** |
| 接受率 | - | 57.2% | - |
| 唯一模式数 | - | 234 | - |
**合成数据集的质量得分高于精选真实数据集,原因如下:**
1. 生成过程中严格执行八要素筛选
2. 模式去重(每个工具模式最多包含50条样本)
3. 优化的对话结构(4-10轮)
4. 词汇与推理多样性可控
## 使用方法
python
from datasets import load_dataset
# 加载合成数据集
dataset = load_dataset("DJLougen/Acta-Synthetic")
# 获取质量指标
sample = dataset["train"][0]
print(sample["quality_score"]) # 0.824
print(sample["semantic_diversity"]) # 0.914
print(sample["lexical_diversity"]) # 0.819
print(sample["tools_used"]) # ["python_interpreter", "calculator"]
## 与真实数据集混合使用
python
from datasets import load_dataset
import random
# 加载两个数据集
real = load_dataset("DJLougen/Acta")["train"]
synthetic = load_dataset("DJLougen/Acta-Synthetic")["train"]
# 按照30%的合成数据占比进行混合
real_samples = [real[i] for i in range(min(3500, len(real)))]
synthetic_samples = [synthetic[i] for i in range(min(1500, len(synthetic)))]
mixed = real_samples + synthetic_samples
random.shuffle(mixed)
print(f"混合后数据集总样本数:{len(mixed)}")
## 生成流程
生成流程可在[Acta仓库](https://huggingface.co/datasets/DJLougen/Acta)中获取。
核心组件包括:
- `SyntheticConversationGenerator`:基于模板的对话生成模块
- `LLMSyntheticGenerator`:大语言模型(LLM)增强生成模块(可选)
- `AgenticDataQualityModel`:八要素质量评分模块
- `SyntheticDataPipeline`:端到端工作流
## 质量验证
所有样本均通过严格的质量阈值校验:
- 词汇多样性 ≥ 0.35
- 语义多样性 ≥ 0.40
- 动词独特性 ≥ 0.01
- 对话轮次:2-12轮
**目标质量得分:≥ 0.75**
## 引用格式
bibtex
@dataset{acta_synthetic_2026,
title = {Acta-Synthetic: High-Quality Synthetic Agentic Tool-Use Dataset},
author = {Lougen, Daniel},
year = {2026},
url = {https://huggingface.co/datasets/DJLougen/Acta-Synthetic}
}
## 许可证
Apache 2.0许可证
## 相关数据集
- [DJLougen/Acta](https://huggingface.co/datasets/DJLougen/Acta):公开样本集(包含5000条真实样本)
- [DJLougen/Acta-Proprietary](https://huggingface.co/datasets/DJLougen/Acta-Proprietary):完整37000条真实样本集(私有)
- [DJLougen/ornstein-proprietary-100k](https://huggingface.co/datasets/DJLougen/ornstein-proprietary-100k):包含82000条带八要素指标的样本集
提供机构:
DJLougen



