five

DJLougen/Acta-Synthetic

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/DJLougen/Acta-Synthetic
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 library_name: datasets tags: - agentic - tool-use - function-calling - synthetic - curated - high-quality - conversations size_categories: - 1K<n<10K --- # Acta-Synthetic High-quality synthetic agentic tool-use conversations generated with 8-factor quality control. ## Overview This dataset contains **5,000 synthetic agentic tool-use conversations** generated using a template-based synthesis pipeline with strict quality controls. **Key Statistics:** - **Train:** 4,500 samples - **Validation:** 500 samples - **Avg quality score:** 0.824 - **Avg semantic diversity:** 0.914 - **Avg lexical diversity:** 0.819 - **Pass rate:** 96.2% ## Generation Method Conversations were generated using the **8-Factor Quality Model**: | Factor | Weight | Rationale | |--------|--------|-----------| | Lexical Diversity | 0.25 | unique_words / total_words (r=+0.967 with quality) | | Semantic Diversity | 0.20 | content richness (r=+0.947) | | Verb Uniqueness | 0.15 | unique_verbs / total_verbs (r=+0.852) | | Turn Efficiency | 0.15 | Optimal 4-10 turns | | Tool Pattern Novelty | 0.10 | Penalty for over-represented patterns | | Reasoning Density | 0.08 | Thinking blocks per turn | | Role Balance | 0.05 | User/assistant ratio | | Content Brevity | 0.02 | Information density | **Quality threshold:** 0.75 (only samples scoring above this were kept) ## Domains - **Database queries:** 15.3% - **Math problems:** 14.8% - **API integration:** 12.7% - **Coding tasks:** 12.3% - **File processing:** 11.6% - **Research:** 11.1% - **Multi-step research:** 11.1% - **Data analysis:** 11.1% ## Tools Available - `web_search` - Web search capability - `wikipedia_search` - Wikipedia knowledge retrieval - `python_interpreter` - Code execution - `code_analyzer` - Code analysis and review - `read_file` - File reading - `write_file` - File writing - `api_request` - HTTP API calls - `database_query` - SQL database queries - `calculator` - Mathematical calculations - `data_transform` - Data transformation ## Comparison with Real Data Validating against [DJLougen/Acta](https://huggingface.co/datasets/DJLougen/Acta) (real agentic data): | Metric | Real (Acta) | Synthetic | Difference | |--------|-------------|-----------|------------| | Quality Score | 0.788 ± 0.052 | 0.826 ± 0.038 | **+0.038** | | Acceptance Rate | - | 57.2% | - | | Unique Patterns | - | 234 | - | **Synthetic data achieves higher quality scores than real curated data** due to: 1. Strict 8-factor filtering during generation 2. Pattern deduplication (max 50 samples per tool pattern) 3. Optimized conversation structure (4-10 turns) 4. Controlled diversity in vocabulary and reasoning ## Usage ```python from datasets import load_dataset # Load synthetic dataset dataset = load_dataset("DJLougen/Acta-Synthetic") # Access quality metrics sample = dataset["train"][0] print(sample["quality_score"]) # 0.824 print(sample["semantic_diversity"]) # 0.914 print(sample["lexical_diversity"]) # 0.819 print(sample["tools_used"]) # ["python_interpreter", "calculator"] ``` ## Mixing with Real Data ```python from datasets import load_dataset import random # Load both datasets real = load_dataset("DJLougen/Acta")["train"] synthetic = load_dataset("DJLougen/Acta-Synthetic")["train"] # Mix with 30% synthetic ratio real_samples = [real[i] for i in range(min(3500, len(real)))] synthetic_samples = [synthetic[i] for i in range(min(1500, len(synthetic)))] mixed = real_samples + synthetic_samples random.shuffle(mixed) print(f"Mixed dataset: {len(mixed)} samples") ``` ## Generation Pipeline The generation pipeline is available in the [Acta repository](https://huggingface.co/datasets/DJLougen/Acta). Key components: - `SyntheticConversationGenerator` - Template-based generation - `LLMSyntheticGenerator` - LLM-enhanced generation (optional) - `AgenticDataQualityModel` - 8-factor quality scoring - `SyntheticDataPipeline` - End-to-end workflow ## Quality Validation All samples pass strict quality thresholds: - Lexical diversity ≥ 0.35 - Semantic diversity ≥ 0.40 - Verb uniqueness ≥ 0.01 - Turn count: 2-12 **Target quality score:** ≥ 0.75 ## Citation ```bibtex @dataset{acta_synthetic_2026, title = {Acta-Synthetic: High-Quality Synthetic Agentic Tool-Use Dataset}, author = {Lougen, Daniel}, year = {2026}, url = {https://huggingface.co/datasets/DJLougen/Acta-Synthetic} } ``` ## License Apache 2.0 ## Related Datasets - [DJLougen/Acta](https://huggingface.co/datasets/DJLougen/Acta) - Public sample (5K real samples) - [DJLougen/Acta-Proprietary](https://huggingface.co/datasets/DJLougen/Acta-Proprietary) - Full 37K real samples (private) - [DJLougen/ornstein-proprietary-100k](https://huggingface.co/datasets/DJLougen/ornstein-proprietary-100k) - 82K samples with 8-factor metrics

--- 语言: - 英语(en) 许可证:Apache 2.0 库名称:datasets 标签: - 智能体(agentic) - 工具使用(tool-use) - 函数调用(function-calling) - 合成数据(synthetic) - 精选(curated) - 高质量(high-quality) - 对话(conversations) 样本规模类别: - 1000 < 样本数量 < 10000 --- # Acta-Synthetic 采用八要素质量管控生成的高质量合成智能体工具使用对话集。 ## 数据集概览 本数据集包含**5000条合成智能体工具使用对话**,通过基于模板的合成流程结合严格的质量管控生成。 **核心统计数据:** - 训练集:4500条样本 - 验证集:500条样本 - 平均质量得分:0.824 - 平均语义多样性:0.914 - 平均词汇多样性:0.819 - 通过率:96.2% ## 生成方法 对话通过**八要素质量模型**生成: | 要素 | 权重 | 原理说明 | |--------|--------|-----------| | 词汇多样性(Lexical Diversity) | 0.25 | unique_words / total_words(与质量的相关系数r=+0.967) | | 语义多样性(Semantic Diversity) | 0.20 | 内容丰富度(相关系数r=+0.947) | | 动词独特性(Verb Uniqueness) | 0.15 | unique_verbs / total_verbs(相关系数r=+0.852) | | 对话轮次效率(Turn Efficiency) | 0.15 | 最优轮次为4-10轮 | | 工具模式新颖性(Tool Pattern Novelty) | 0.10 | 对过度重复的模式施加惩罚 | | 推理密度(Reasoning Density) | 0.08 | 每轮对话中的思考模块数量 | | 角色平衡性(Role Balance) | 0.05 | 用户/助手对话占比 | | 内容简洁性(Content Brevity) | 0.02 | 信息密度 | **质量阈值:0.75(仅保留得分高于该阈值的样本)** ## 应用领域 - 数据库查询:15.3% - 数学问题:14.8% - API集成:12.7% - 编码任务:12.3% - 文件处理:11.6% - 基础研究:11.1% - 多步研究:11.1% - 数据分析:11.1% ## 可用工具 - `web_search`:网页搜索功能 - `wikipedia_search`:维基百科知识检索 - `python_interpreter`:代码执行 - `code_analyzer`:代码分析与审查 - `read_file`:文件读取 - `write_file`:文件写入 - `api_request`:HTTP API调用 - `database_query`:SQL数据库查询 - `calculator`:数学计算 - `data_transform`:数据转换 ## 与真实数据集对比 基于[DJLougen/Acta](https://huggingface.co/datasets/DJLougen/Acta)(真实智能体对话数据集)进行验证: | 指标 | 真实数据集(Acta) | 合成数据集 | 差值 | |--------|-------------|-----------|------------| | 质量得分 | 0.788 ± 0.052 | 0.826 ± 0.038 | **+0.038** | | 接受率 | - | 57.2% | - | | 唯一模式数 | - | 234 | - | **合成数据集的质量得分高于精选真实数据集,原因如下:** 1. 生成过程中严格执行八要素筛选 2. 模式去重(每个工具模式最多包含50条样本) 3. 优化的对话结构(4-10轮) 4. 词汇与推理多样性可控 ## 使用方法 python from datasets import load_dataset # 加载合成数据集 dataset = load_dataset("DJLougen/Acta-Synthetic") # 获取质量指标 sample = dataset["train"][0] print(sample["quality_score"]) # 0.824 print(sample["semantic_diversity"]) # 0.914 print(sample["lexical_diversity"]) # 0.819 print(sample["tools_used"]) # ["python_interpreter", "calculator"] ## 与真实数据集混合使用 python from datasets import load_dataset import random # 加载两个数据集 real = load_dataset("DJLougen/Acta")["train"] synthetic = load_dataset("DJLougen/Acta-Synthetic")["train"] # 按照30%的合成数据占比进行混合 real_samples = [real[i] for i in range(min(3500, len(real)))] synthetic_samples = [synthetic[i] for i in range(min(1500, len(synthetic)))] mixed = real_samples + synthetic_samples random.shuffle(mixed) print(f"混合后数据集总样本数:{len(mixed)}") ## 生成流程 生成流程可在[Acta仓库](https://huggingface.co/datasets/DJLougen/Acta)中获取。 核心组件包括: - `SyntheticConversationGenerator`:基于模板的对话生成模块 - `LLMSyntheticGenerator`:大语言模型(LLM)增强生成模块(可选) - `AgenticDataQualityModel`:八要素质量评分模块 - `SyntheticDataPipeline`:端到端工作流 ## 质量验证 所有样本均通过严格的质量阈值校验: - 词汇多样性 ≥ 0.35 - 语义多样性 ≥ 0.40 - 动词独特性 ≥ 0.01 - 对话轮次:2-12轮 **目标质量得分:≥ 0.75** ## 引用格式 bibtex @dataset{acta_synthetic_2026, title = {Acta-Synthetic: High-Quality Synthetic Agentic Tool-Use Dataset}, author = {Lougen, Daniel}, year = {2026}, url = {https://huggingface.co/datasets/DJLougen/Acta-Synthetic} } ## 许可证 Apache 2.0许可证 ## 相关数据集 - [DJLougen/Acta](https://huggingface.co/datasets/DJLougen/Acta):公开样本集(包含5000条真实样本) - [DJLougen/Acta-Proprietary](https://huggingface.co/datasets/DJLougen/Acta-Proprietary):完整37000条真实样本集(私有) - [DJLougen/ornstein-proprietary-100k](https://huggingface.co/datasets/DJLougen/ornstein-proprietary-100k):包含82000条带八要素指标的样本集
提供机构:
DJLougen
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作