gaia2_filesystem

Name: gaia2_filesystem
Creator: maas
Published: 2026-01-06 16:46:58
License: 暂无描述

魔搭社区2026-01-06 更新2025-09-27 收录

下载链接：

https://modelscope.cn/datasets/meta-agents-research-environments/gaia2_filesystem

下载链接

链接失效反馈

官方服务：

资源简介：

# GAIA2 Filesystem This is a dataset containing files for the GAIA2 benchmark. You should not use this dataset on its own, but instead use the [Meta Agents Research Environments](https://github.com/facebookresearch/meta-agents-research-environments) framework to execute scenarios from that [GAIA2 dataset](https://huggingface.co/datasets/meta-agents-research-environments/gaia2). ## Dataset Link [https://huggingface.co/datasets/meta-agents-research-environments/gaia2](https://huggingface.co/datasets/meta-agents-research-environments/gaia2) ## Contact Details **Publishing POC:** Meta AI Research Team **Affiliation:** Meta Platforms, Inc. **Website:** [https://github.com/facebookresearch/meta-agents-research-environments](https://github.com/facebookresearch/meta-agents-research-environments) ## Authorship **Publishers:** Meta AI Research Team **Dataset Owners:** Meta Platforms, Inc. **Funding Sources:** Meta Platforms, Inc. ## Dataset Overview **Sensitivity of Data:** The dataset contains simulated scenarios with fictional user data, contacts, messages, and interactions, extended with professional annotations. No real personally identifiable information (PII) is intentionally included. All data is synthetically generated for research purposes. **Dataset Version:** 1.0 **Maintenance:** The dataset is maintained by the Meta AI Research team with periodic updates for bug fixes and improvements. ## Motivations & Intentions **Motivations:** GAIA2 was created to address gaps in AI agent evaluation, specifically the lack of dynamic, time-aware, and multi-agent collaborative scenarios in existing benchmarks. Most benchmarks focus on static tasks. **Intended Use:** The dataset is designed for: - Research on AI agent capabilities - Benchmarking agent performance across multiple dimensions - Academic research on multi-agent systems - Development and evaluation of AI assistants - Comparative studies of agent architectures ## Access, Retention, & Wipeout The Data is released CC-by 4.0 and is intended for benchmarking purposes only. Most files are outputs of Llama 3.3 and Llama 4 Maverick and subject to the respective licenses ([Llama 3.3 license](https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/LICENSE); [Llama 4 License](https://github.com/meta-llama/llama-models/blob/main/models/llama4/LICENSE)). If you use this portion of the data to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include “Llama” at the beginning of any such AI model name. Third party content pulled from other locations are subject to its own licenses and you may have other legal obligations or restrictions that govern your use of that content. Some files come from a subset of the Wikipedia and are licensed under the [Wikipedia own license](https://en.wikipedia.org/wiki/Wikipedia:Copyrights). **Wipeout & Deletion:** As the dataset contains only synthetic data, no personal data deletion procedures are required. ## Provenance **Collection Method:** Scenarios were created through human annotation using a specialized GUI and graph editor within the Meta Agents Research Environments framework. Professional annotators created scenarios following detailed guidelines for each capability category. These scenarios were built on top of entirely generated universes. **Collection Criteria:** Scenarios were designed to be: - Solvable using available apps and content within Meta Agents Research Environments universes - Specific with exactly one correct solution for reliable verification - Challenging, requiring reasoning and multi-step execution - Realistic, based on authentic user interactions **Relationship to Source:** All scenarios are original creations designed specifically for the GAIA2 benchmark, built within 10 distinct Meta Agents Research Environments universes with pre-populated data. A small sample of Wikipedia articles is included in these universes. **Version:** Initial release version 1.0 ## Human and Other Sensitive Attributes **Attribute Identification:** The dataset contains fictional demographic information (age, location) and simulated personal interactions (messages, contacts, calendar events) as part of the scenario context. No real human attributes or sensitive information is included. **Mitigation Strategies:** All data is synthetically generated. Annotators were instructed to exclude sensitive topics and personally identifiable information during scenario creation. ## Extended Use **Use with Other Data:** GAIA2 can be combined with other agent evaluation benchmarks for assessment. It complements web-based benchmarks like the original GAIA. **Forking & Sampling:** Researchers may create derivative datasets or sample subsets. The dataset includes a "mini" configuration with 200 representative scenarios for faster evaluation. The truth data is available for the `validation` split of the dataset. Please help us keep this benchmark strong by not training on this evaluation data. We encourage others to use the Meta Agents Research Environments framework to develop more evaluation and training data for agents within its simulated environment. **Use in ML or AI Systems:** Designed for evaluating AI agents and language models. Includes automated verification systems and judge-based evaluation for development feedback. ## Transformations **Synopsis:** Raw annotated scenarios undergo cleaning and preprocessing to remove oracle events, hints, and metadata not needed for agent evaluation while preserving the core scenario structure. **Breakdown:** - Removal of oracle events from the events array for test scenarios - Cleaning of annotation metadata (annotator details, validation comments) - Preprocessing for execution without oracle guidance - Preservation of scenario structure and validation criteria - Maintenance of temporal constraints and event dependencies ## Annotations & Labeling **Process Description:** Scenarios were annotated by professional vendors following a multi-stage process with quality assurance at both vendor and research team levels. **Human Annotators:** Professional annotators with training on the Meta Agents Research Environments framework and specific capability requirements. Each scenario underwent validation by multiple independent annotators. The annotation process included: 1. Initial scenario creation by Annotator A 2. Independent validation by Annotator B without seeing A's solution 3. Third validation by Annotator C 4. Final review by Annotator D to confirm consistency across all solutions ## Validation Types **Description of Human Validators:** Multiple layers of human validation were employed: - Vendor-side quality assurance with multi-annotator validation - Research team internal QA to identify and resolve issues - Automated pre-QA guardrails to prevent invalid scenario structures - Post-QA evaluation using model success rates to identify problematic scenarios ## Sampling Methods **Sampling Methods:** Scenarios were systematically created across 10 different Meta Agents Research Environments universes to ensure diversity. Equal representation across capability categories was maintained, with 160 scenarios per core capability (Execution, Search, Adaptability, Time, Ambiguity) and a representative sample of each capability's scenarios for augmentation capabilities (Agent2Agent, App/Environment Noise). ## Citation If you use Meta Agents Research Environments in your work, please cite: ```bibtex TODO ```

# GAIA2 文件系统本数据集为GAIA2基准测试配套的文件集合。请勿单独使用本数据集，需借助[Meta Agents Research Environments（Meta智能体研究环境框架）](https://github.com/facebookresearch/meta-agents-research-environments)，运行该[GAIA2数据集](https://huggingface.co/datasets/meta-agents-research-environments/gaia2)中的测试场景。 ## 数据集链接 [https://huggingface.co/datasets/meta-agents-research-environments/gaia2](https://huggingface.co/datasets/meta-agents-research-environments/gaia2) ## 联系详情 **发布对接人（Point of Contact, POC）:** Meta人工智能研究团队 **所属机构:** Meta平台公司（Meta Platforms, Inc.） **官方网站:** [https://github.com/facebookresearch/meta-agents-research-environments](https://github.com/facebookresearch/meta-agents-research-environments) ## 作者信息 **发布方:** Meta人工智能研究团队 **数据集所有者:** Meta平台公司（Meta Platforms, Inc.） **资助方:** Meta平台公司（Meta Platforms, Inc.） ## 数据集概览 **数据敏感性:** 本数据集包含虚构用户数据、联系人、消息与交互的模拟场景，并辅以专业标注。未刻意包含任何真实的个人可识别信息（Personally Identifiable Information, PII），所有数据均为科研用途合成生成。 **数据集版本:** 1.0 **维护情况:** 本数据集由Meta人工智能研究团队维护，将定期发布更新以修复漏洞并优化性能。 ## 研究动机与预期用途 **研究动机:** GAIA2基准旨在弥补当前AI智能体评估基准的不足，尤其是现有基准缺乏动态、时序感知且支持多智能体协作的测试场景——绝大多数现有基准仅聚焦于静态任务。 **预期用途:** 本数据集适用于以下场景： - AI智能体能力研究 - 多维度智能体性能基准测试 - 多智能体系统学术研究 - AI助手开发与评估 - 智能体架构对比研究 ## 访问、留存与删除规则本数据集采用CC-by 4.0协议发布，仅用于基准测试目的。多数文件为Llama 3.3与Llama 4 Maverick的生成结果，需遵循对应许可协议([Llama 3.3许可协议](https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/LICENSE)；[Llama 4许可协议](https://github.com/meta-llama/llama-models/blob/main/models/llama4/LICENSE))。若您使用本数据集的内容开发、训练、微调或优化人工智能模型，并对该模型进行分发或公开提供，则需在该人工智能模型名称的开头添加“Llama”字样。从其他渠道获取的第三方内容需遵循其自身许可协议，您可能需遵守其他法律义务与使用限制。部分文件源自维基百科子集，需遵循[维基百科自有许可协议](https://en.wikipedia.org/wiki/Wikipedia:Copyrights)。 **数据删除:** 由于本数据集仅包含合成数据，无需执行任何个人数据删除流程。 ## 数据来源 **采集方式:** 测试场景由专业标注人员借助Meta智能体研究环境框架内的专用图形用户界面与图形编辑器创建。标注人员需遵循针对各能力类别的详细指南，所有场景均构建于完全合成的虚拟宇宙之上。 **采集标准:** 测试场景需满足以下要求： - 可通过Meta智能体研究环境框架内的可用应用与内容完成求解 - 具备唯一明确的正确解，以支持可靠验证 - 具备挑战性，需通过推理与多步骤执行完成 - 具备真实感，基于真实用户交互场景设计 **与源数据的关系:** 所有场景均为专为GAIA2基准测试设计的原创内容，构建于10个预置数据的Meta智能体研究环境虚拟宇宙之中，仅包含少量维基百科文章作为补充。 **版本:** 初始发布版本1.0 ## 人类与其他敏感属性 **属性识别:** 本数据集包含虚构的人口统计信息（年龄、所在地）与模拟的个人交互内容（消息、联系人、日历事件），作为场景上下文的一部分。未包含任何真实人类属性或敏感信息。 **缓解策略:** 所有数据均为合成生成，标注人员在场景创建过程中被要求排除敏感主题与个人可识别信息。 ## 扩展使用 **与其他数据结合使用:** GAIA2可与其他智能体评估基准结合用于评估，可作为包括原始GAIA在内的基于网页的基准测试的补充。 **分支与采样:** 研究人员可创建衍生数据集或采样子集。本数据集包含“迷你版”配置，包含200个代表性场景以加快评估速度。数据集的`validation`（验证）划分包含真值数据。请勿使用本评估数据进行模型训练，以助力本基准测试的持续完善。我们鼓励研究人员借助Meta智能体研究环境框架，在其模拟环境中开发更多智能体评估与训练数据。 **在机器学习或人工智能系统中的使用:** 本数据集专为评估AI智能体与大语言模型（Large Language Model, LLM）设计，包含自动验证系统与基于评判者的评估流程，用于开发反馈。 ## 数据转换 **概述:** 原始带标注的场景需经过清洗与预处理，移除测试场景中的神谕事件、提示与评估无需的元数据，同时保留核心场景结构。 **详细处理步骤:** - 移除测试场景事件数组中的神谕事件 - 清理标注元数据（标注人员信息、验证评论） - 预处理以实现无神谕指引的执行 - 保留场景结构与验证标准 - 保留时序约束与事件依赖关系 ## 标注与标记 **流程说明:** 测试场景由专业服务商按照多阶段流程完成标注，并在标注方与研究团队层面分别开展质量保证。 **人工标注人员:** 经过Meta智能体研究环境框架与特定能力要求培训的专业标注人员。每个场景均需经过多名独立标注人员的验证。标注流程包括： 1. 由标注员A完成初始场景创建 2. 由标注员B在未查看标注员A的解决方案的前提下开展独立验证 3. 由标注员C开展第三次验证 4. 由标注员D开展最终审核，确认所有解决方案的一致性 ## 验证类型 **人工验证者说明:** 本数据集采用多层人工验证机制： - 标注方层面的多标注员质量保证验证 - 研究团队内部的质量检查，以识别并解决问题 - 自动化预验证护栏，以防止无效的场景结构 - 基于模型成功率的后质量评估，以识别存在问题的测试场景 ## 采样方法 **采样方式:** 测试场景系统地创建于10个不同的Meta智能体研究环境虚拟宇宙中，以确保多样性。各能力类别保持均衡的样本占比：核心能力（执行、搜索、适应性、时序、歧义性）每类包含160个场景，同时针对各能力类别采样代表性场景以覆盖增强能力（智能体间交互、应用/环境噪声）。 ## 引用若您在研究工作中使用Meta智能体研究环境框架，请引用： bibtex TODO

提供机构：

maas

创建时间：

2025-09-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集