five

DJLougen/Acta

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/DJLougen/Acta
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 library_name: datasets tags: - agentic - tool-use - function-calling - curated - high-quality - conversations size_categories: - 1K<n<10K --- # Acta *Acta* (Latin: "acts, deeds, records") - A premium curated sample of high-quality agentic tool-use conversations, filtered using an 8-factor quality model based on statistical correlation analysis of diversity metrics. ## Overview This public sample contains **5,000 conversations** from the full 37K Acta dataset. It was created by analyzing 5 major agentic datasets (FlameFox, Hermes Reasoning, Hermes Multi-turn, Smolagents, SWE Agent) and applying evidence-based quality filters to maximize training signal density. **Key Statistics:** - **Train:** 4,500 samples - **Validation:** 500 samples - **Avg semantic diversity:** 0.53 - **Avg lexical diversity:** 0.47 - **Sources:** 5 datasets harmonized ## The 8-Factor Quality Model Each sample was scored using 8 factors weighted by their correlation with training quality: | Factor | Weight | Threshold | Rationale | |--------|--------|-----------|-----------| | Lexical Diversity | 0.25 | >=0.35 | unique_words / total_words (r=+0.967 with quality) | | Semantic Diversity | 0.20 | >=0.40 | content richness excluding stopwords (r=+0.947) | | Verb Uniqueness | 0.15 | >=0.01 | unique_verbs / total_verbs (r=+0.852) | | Turn Efficiency | 0.15 | 2-12 turns | Optimal conversation length (longer = repetitive) | | Tool Pattern Novelty | 0.10 | - | Penalty for over-represented tool sequences | | Reasoning Density | 0.08 | - | Thinking blocks per assistant turn | | Role Balance | 0.05 | - | User/assistant turn ratio | | Content Brevity | 0.02 | - | Information density (shorter = denser signal) | ## Source Distribution | Source | Samples | Avg Quality | Characteristics | |--------|---------|-------------|-----------------| | FlameFox Agentic | ~2,600 | 0.819 | Highest diversity, balanced tools | | Hermes Reasoning | ~1,400 | 0.598 | Good semantic diversity | | Hermes Multi-turn | ~450 | 0.534 | Multi-turn, deduplicated | | Smolagents Code | ~70 | 0.592 | Low redundancy | | SWE Agent GLM | ~20 | 0.433 | Shortest, least repetitive traces | ## Usage ```python from datasets import load_dataset # Load the curated sample dataset = load_dataset("DJLougen/Acta") # Access quality metrics sample = dataset["train"][0] print(sample["quality_score"]) # 0.0 - 1.0 print(sample["semantic_diversity"]) # 0.546 print(sample["lexical_diversity"]) # 0.400 ``` ## Key Findings 1. **Quality ≠ Quantity** - 37K curated samples > 200K raw samples 2. **Lexical diversity is the strongest quality predictor** (r=+0.967) 3. **More content ≠ better signal** - SWE Agent has 5x more text but 3x lower quality 4. **4-10 turns optimal** - longer conversations become repetitive (r=-0.982) 5. **Tool redundancy is rampant** - some patterns repeat 3000+ times in raw data ## Citation ```bibtex @dataset{acta_2026, title = {Acta: A Quality-Curated Agentic Tool-Use Dataset}, author = {Lougen, Daniel}, year = {2026}, url = {https://huggingface.co/datasets/DJLougen/Acta} } ``` ## License Apache 2.0 ## Acknowledgments Source datasets: - [FlameF0X/agentic-code](https://huggingface.co/datasets/FlameF0X/agentic-code) - [interstellarninja/hermes_reasoning_tool_use](https://huggingface.co/datasets/interstellarninja/hermes_reasoning_tool_use) - [interstellarninja/tool-use-multiturn-reasoning](https://huggingface.co/datasets/interstellarninja/tool-use-multiturn-reasoning) - [smolagents/codeagent-traces](https://huggingface.co/datasets/smolagents/codeagent-traces) - [DCAgent/neulab-nebius-swe-agent-trajectories-sandboxes_glm_4.7_traces_jupiter](https://huggingface.co/datasets/DCAgent/neulab-nebius-swe-agent-trajectories-sandboxes_glm_4.7_traces_jupiter) --- **Full 37K proprietary dataset:** [DJLougen/Acta-Proprietary](https://huggingface.co/datasets/DJLougen/Acta-Proprietary)

language: - 英语 license: apache-2.0 许可证:Apache 2.0 library_name: datasets 库名称:datasets tags: - 智能体相关(agentic) - 工具使用(tool-use) - 函数调用(function-calling) - 精选(curated) - 高质量(high-quality) - 对话(conversations) size_categories: - 1K<n<10K 规模类别:1千 < n < 1万 # Acta *Acta*(拉丁语意为“行动、事迹、记录”)是一份经过精选的高质量智能体工具使用对话样本集,基于对多样性指标的统计相关性分析,通过八因子质量模型完成筛选。 ## 概述 本公开样本集包含完整37K条Acta数据集中的**5000条对话**。该数据集通过分析5个主流智能体数据集(FlameFox、Hermes Reasoning、Hermes Multi-turn、Smolagents、SWE Agent),并应用循证质量过滤方法以最大化训练信号密度而构建。 **核心统计数据:** - **训练集:** 4500个样本 - **验证集:** 500个样本 - **平均语义多样性:** 0.53 - **平均词汇多样性:** 0.47 - **数据源:** 5个经过对齐的数据集 ## 八因子质量模型 每个样本均基于其与训练质量的相关性赋予权重,通过以下8个因子进行评分: | 因子 | 权重 | 阈值 | 说明 | |--------|--------|-----------|-----------| | 词汇多样性(Lexical Diversity) | 0.25 | ≥0.35 | 唯一词数/总词数(与质量的相关系数*r*=+0.967) | | 语义多样性(Semantic Diversity) | 0.20 | ≥0.40 | 去除停用词后的内容丰富度(相关系数*r*=+0.947) | | 动词独特性(Verb Uniqueness) | 0.15 | ≥0.01 | 唯一动词数/总动词数(相关系数*r*=+0.852) | | 轮次效率(Turn Efficiency) | 0.15 | 2-12轮 | 最优对话长度(过长则易出现重复内容) | | 工具模式新颖性(Tool Pattern Novelty) | 0.10 | - | 对过度重复的工具序列施加惩罚 | | 推理密度(Reasoning Density) | 0.08 | - | 每轮助手回复中的思考块数量 | | 角色平衡性(Role Balance) | 0.05 | - | 用户与助手的轮次比例 | | 内容简洁性(Content Brevity) | 0.02 | - | 信息密度(内容越简短则信号密度越高) | ## 数据源分布 | 数据源 | 样本数量 | 平均质量得分 | 特性 | |--------|---------|-------------|-----------------| | FlameFox Agentic | ~2600 | 0.819 | 多样性最高,工具使用均衡 | | Hermes Reasoning | ~1400 | 0.598 | 语义多样性表现优异 | | Hermes Multi-turn | ~450 | 0.534 | 多轮对话,已去重 | | Smolagents Code | ~70 | 0.592 | 冗余度极低 | | SWE Agent GLM | ~20 | 0.433 | 篇幅最短,重复痕迹最少 | ## 使用方法 python from datasets import load_dataset # 加载精选样本集 dataset = load_dataset("DJLougen/Acta") # 访问质量指标 sample = dataset["train"][0] print(sample["quality_score"]) # 取值范围0.0 - 1.0 print(sample["semantic_diversity"]) # 0.546 print(sample["lexical_diversity"]) # 0.400 ## 核心发现 1. **质量≠数量**:37K条精选样本优于200K条原始样本 2. **词汇多样性是最强的质量预测因子**(相关系数*r*=+0.967) 3. **内容更多≠信号更佳**:SWE Agent数据集的文本量是其他数据集的5倍,但质量仅为其三分之一 4. **4-10轮为最优对话长度**:过长的对话易出现重复(相关系数*r*=-0.982) 5. **工具使用模式重复现象严重**:原始数据中部分模式重复次数超过3000次 ## 引用格式 bibtex @dataset{acta_2026, title = {Acta: 经质量精选的智能体工具使用数据集}, author = {Lougen, Daniel}, year = {2026}, url = {https://huggingface.co/datasets/DJLougen/Acta} } ## 许可证 Apache 2.0 ## 致谢 数据源包括: - [FlameF0X/agentic-code](https://huggingface.co/datasets/FlameF0X/agentic-code) - [interstellarninja/hermes_reasoning_tool_use](https://huggingface.co/datasets/interstellarninja/hermes_reasoning_tool_use) - [interstellarninja/tool-use-multiturn-reasoning](https://huggingface.co/datasets/interstellarninja/tool-use-multiturn-reasoning) - [smolagents/codeagent-traces](https://huggingface.co/datasets/smolagents/codeagent-traces) - [DCAgent/neulab-nebius-swe-agent-trajectories-sandboxes_glm_4.7_traces_jupiter](https://huggingface.co/datasets/DCAgent/neulab-nebius-swe-agent-trajectories-sandboxes_glm_4.7_traces_jupiter) --- **完整37K条专有数据集:** [DJLougen/Acta-Proprietary](https://huggingface.co/datasets/DJLougen/Acta-Proprietary)
提供机构:
DJLougen
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作