gaia2

Name: gaia2
Creator: maas
Published: 2026-01-06 16:46:58
License: 暂无描述

魔搭社区2026-01-06 更新2025-09-27 收录

下载链接：

https://modelscope.cn/datasets/meta-agents-research-environments/gaia2

下载链接

链接失效反馈

官方服务：

资源简介：

# Gaia2 [Paper](https://huggingface.co/papers/2509.17158) | [Code](https://github.com/facebookresearch/meta-agents-research-environments) | [Project Page](https://facebookresearch.github.io/meta-agents-research-environments/) ## Dataset Summary Gaia2 is a benchmark dataset for evaluating AI agent capabilities in simulated environments. The dataset contains 800 scenarios that test agent performance in environments where time flows continuously and events occur dynamically. The dataset evaluates seven core capabilities: Execution (multi-step planning and state changes), Search (information gathering and synthesis), Adaptability (dynamic response to environmental changes), Time (temporal reasoning and scheduling), Ambiguity (handling unclear or impossible tasks), Agent2Agent (multi-agent collaboration), and Noise (robustness to environmental instability). The benchmark includes temporal constraints, dynamic environment events, and multi-agent collaboration scenarios. ## Dataset Link [https://huggingface.co/datasets/meta-agents-research-environments/gaia2](https://huggingface.co/datasets/meta-agents-research-environments/gaia2) ### Getting Started | | | |---|---| | **[Gaia2 Evaluation](https://facebookresearch.github.io/meta-agents-research-environments/user_guide/gaia2_evaluation.html)** | Build and evaluate your agents on the Gaia2 benchmark, a comprehensive suite of 800 dynamic scenarios across 10 universes. | | **[Gaia2 Leaderboard](https://huggingface.co/spaces/meta-agents-research-environments/leaderboard)** | Check the self-published results from Gaia2 Benchmark runs. | | **[Gaia2 Blog Post](https://huggingface.co/blog/gaia2)** | Learn more about Gaia2 on the Hugging Face blog. | | **[Paper](https://huggingface.co/papers/2509.17158)** | Read the research paper detailing the Gaia2 benchmark and evaluation methodology. | | **[Learn More](https://facebookresearch.github.io/meta-agents-research-environments/foundations/index.html)** | Dive deeper into the core concepts of agents, environments, apps, events, and scenarios. | | **[Demo](https://huggingface.co/spaces/meta-agents-research-environments/demo)** | [Try the ARE Demo on Hugging Face](https://huggingface.co/spaces/meta-agents-research-environments/demo) — Play around with the agent platform directly in your browser, no installation required! | ## Contact Details **Publishing POC:** Meta AI Research Team **Affiliation:** Meta Platforms, Inc. **Website:** [https://github.com/facebookresearch/meta-agents-research-environments](https://github.com/facebookresearch/meta-agents-research-environments) ## Authorship **Publishers:** Meta AI Research Team **Dataset Owners:** Meta Platforms, Inc. **Funding Sources:** Meta Platforms, Inc. ## Dataset Overview **Sensitivity of Data:** The dataset contains simulated scenarios with fictional user data, contacts, messages, and interactions, extended with professional annotations. No real personally identifiable information (PII) is intentionally included. All data is synthetically generated for research purposes. **Dataset Version:** 1.0 **Maintenance:** The dataset is maintained by the Meta AI Research team with periodic updates for bug fixes and improvements. ## Example of Data Points Each data point represents a scenario with the following structure: ```json { "id": "0000_00000000000000000000000000000000", "scenario_id": "scenario_universe_00_id", "split": "validation", "data": { "metadata": { "definition": {...} }, "apps": {...}, "events": [...], } } ``` For detailed specifications of the complete JSON format structure, including all nested fields and data types, refer to the [API Reference documentation](https://facebookresearch.github.io/meta-agents-research-environments/api_reference/json_format.html). We recommend using the Meta Agents Research Environments framework to execute scenarios and verify their correctness. The framework is a core part of using this dataset and is available at [https://github.com/facebookresearch/meta-agents-research-environments](https://github.com/facebookresearch/meta-agents-research-environments). ## Motivations & Intentions **Motivations:** Gaia2 was created to address gaps in AI agent evaluation, specifically the lack of dynamic, time-aware, and multi-agent collaborative scenarios in existing benchmarks. Most benchmarks focus on static tasks. **Intended Use:** The dataset is designed for: - Research on AI agent capabilities - Benchmarking agent performance across multiple dimensions - Academic research on multi-agent systems - Development and evaluation of AI assistants - Comparative studies of agent architectures ## Access, Retention, & Wipeout **Access Policy:** The Data is released CC-by 4.0 and is intended for benchmarking purposes only. The synthetic data are outputs of Llama 3.3 and Llama 4 Maverick and subject to the respective licenses ([Llama 3.3 license](https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/LICENSE); [Llama 4 License](https://github.com/meta-llama/llama-models/blob/main/models/llama4/LICENSE)). If you use this portion of the data to create, train, fine tune, or otherwise improve an AI model, which is distributed or made available, you shall also include “Llama” at the beginning of any such AI model name. Third party content pulled from other locations are subject to its own licenses and you may have other legal obligations or restrictions that govern your use of that content. **Wipeout & Deletion:** As the dataset contains only synthetic data, no personal data deletion procedures are required. ## Provenance **Collection Method:** Scenarios were created through human annotation using a specialized GUI and graph editor within the Meta Agents Research Environments framework. Professional annotators created scenarios following detailed guidelines for each capability category. These scenarios were built on top of entirely generated universes. **Collection Criteria:** Scenarios were designed to be: - Solvable using available apps and content within Meta Agents Research Environments universes - Specific with exactly one correct solution for reliable verification - Challenging, requiring reasoning and multi-step execution - Realistic, based on authentic user interactions **Relationship to Source:** All scenarios are original creations designed specifically for the Gaia2 benchmark, built within 10 distinct Meta Agents Research Environments universes with pre-populated data. A small sample of Wikipedia articles is included in these universes. **Version:** Initial release version 1.0 ## Human and Other Sensitive Attributes **Attribute Identification:** The dataset contains fictional demographic information (age, location) and simulated personal interactions (messages, contacts, calendar events) as part of the scenario context. No real human attributes or sensitive information is included. **Mitigation Strategies:** All data is synthetically generated. Annotators were instructed to exclude sensitive topics and personally identifiable information during scenario creation. ## Extended Use **Use with Other Data:** Gaia2 can be combined with other agent evaluation benchmarks for assessment. It complements web-based benchmarks like the original GAIA. **Forking & Sampling:** Researchers may create derivative datasets or sample subsets. The dataset includes a "mini" configuration with 200 representative scenarios for faster evaluation. The truth data is available for the `validation` split of the dataset. Please help us keep this benchmark strong by not training on this evaluation data. We encourage others to use the Meta Agents Research Environments framework to develop more evaluation and training data for agents within its simulated environment. **Use in ML or AI Systems:** Designed for evaluating AI agents and language models. Includes automated verification systems and judge-based evaluation for development feedback. ## Transformations **Synopsis:** Raw annotated scenarios undergo cleaning and preprocessing to remove oracle events, hints, and metadata not needed for agent evaluation while preserving the core scenario structure. **Breakdown:** - Removal of oracle events from the events array for test scenarios - Cleaning of annotation metadata (annotator details, validation comments) - Preprocessing for execution without oracle guidance - Preservation of scenario structure and validation criteria - Maintenance of temporal constraints and event dependencies ## Annotations & Labeling **Process Description:** Scenarios were annotated by professional vendors following a multi-stage process with quality assurance at both vendor and research team levels. **Human Annotators:** Professional annotators with training on the Meta Agents Research Environments framework and specific capability requirements. Each scenario underwent validation by multiple independent annotators. The annotation process included: 1. Initial scenario creation by Annotator A 2. Independent validation by Annotator B without seeing A's solution 3. Third validation by Annotator C 4. Final review by Annotator D to confirm consistency across all solutions ## Validation Types **Description of Human Validators:** Multiple layers of human validation were employed: - Vendor-side quality assurance with multi-annotator validation - Research team internal QA to identify and resolve issues - Automated pre-QA guardrails to prevent invalid scenario structures - Post-QA evaluation using model success rates to identify problematic scenarios ## Sampling Methods **Sampling Methods:** Scenarios were systematically created across 10 different Meta Agents Research Environments universes to ensure diversity. Equal representation across capability categories was maintained, with 160 scenarios per core capability (Execution, Search, Adaptability, Time, Ambiguity) and a representative sample of each capability's scenarios for augmentation capabilities (Agent2Agent, App/Environment Noise). ## How to Use the Dataset Gaia2 is designed to be used with the Meta Agents Research Environments framework for comprehensive agent evaluation. The dataset supports both development and leaderboard evaluation workflows. ### Installation and Setup For a more streamlined experience, you can use `uvx` to run commands directly without any installation: ```shell # Run commands directly with uvx (no installation needed) uvx --from meta-agents-research-environments are-benchmark --help ``` If you would rather install locally, we recommend setting an environment with conda or venv and then install the Meta Agents Research Environments framework: ```shell # Recommended: Using uv (faster and more reliable) uv pip install meta-agents-research-environments # Alternative: Using pip pip install meta-agents-research-environments ``` To use the Gaia2 dataset and upload your results to the leaderboard, you will also need to login to HuggingFace to access the dataset (first install the huggingface cli): ```shell huggingface-cli login ``` Check the documentation on how to configure your model provider. Gaia2 supports various models through LiteLLM integration. ### Dataset Structure Gaia2 contains a single validation splits of 800 scenarios with oracle events for development and leaderboard submission (includes ground truth). The dataset is organized into capability-specific configurations: - `execution`: Multi-step planning and state-changing operations (200 scenarios) - `search`: Information gathering and synthesis (200 scenarios) - `adaptability`: Dynamic response to environmental changes (200 scenarios) - `time`: Temporal reasoning and scheduling (200 scenarios) - `ambiguity`: Handling unclear or impossible tasks (200 scenarios) - `mini`: Representative subset across all capabilities (200 scenarios) ### Development Workflow **1\. Validation Phase** Start with validation scenarios to test your setup and iterate on your agent: ```shell # Test with a small subset first uvx --from meta-agents-research-environments are-benchmark run --hf-dataset meta-agents-research-environments/gaia2 \ --hf-split validation --hf-config mini \ --model your-model --provider your-provider \ --agent default --limit 20 \ --output_dir ./validation_results ``` **2\. Capability-Specific Testing** Focus on specific capabilities for targeted development: ```shell # Test execution capabilities uvx --from meta-agents-research-environments are-benchmark run --hf-dataset meta-agents-research-environments/gaia2 \ --hf-split validation --hf-config execution \ --model your-model --provider your-provider \ --agent default --limit 10 ``` **3\. Multi-Agent and Noise Testing** Test advanced scenarios with agent-to-agent collaboration and environmental noise: ```shell # Enable Agent2Agent mode (agents communicate with other agents) uvx --from meta-agents-research-environments are-benchmark run --hf-dataset meta-agents-research-environments/gaia2 \ --hf-split validation --hf-config mini \ --model your-model --provider your-provider \ --agent default --a2a_app_prop 1.0 # Enable noise augmentation for robustness testing uvx --from meta-agents-research-environments are-benchmark run --hf-dataset meta-agents-research-environments/gaia2 \ --hf-split validation --hf-config mini \ --model your-model --provider your-provider \ --agent default --noise ``` ### Official Evaluation and Leaderboard Submission **Complete Gaia2 Evaluation** Use the dedicated `gaia2-run` command for leaderboard evaluation: ```shell # Full Gaia2 test evaluation with automatic upload uvx --from meta-agents-research-environments are-benchmark gaia2-run --hf-dataset meta-agents-research-environments/gaia2 \ --model your-model --provider your-provider \ --agent default \ --output_dir ./gaia2_submission_results \ --hf_upload your-org/gaia2-submission-traces ``` This command automatically: - Runs all capability configurations (execution, search, adaptability, time, ambiguity) - Executes three evaluation phases: standard, Agent2Agent, and noise - Forces 3 runs per scenario for variance analysis - Generates submission-ready traces for the leaderboard **Leaderboard Submission Process** 1. Visit the [Gaia2 Leaderboard](https://huggingface.co/spaces/meta-agents-research-environments/leaderboard) 2. Login with your HuggingFace account 3. Provide your dataset name containing the traces 4. Submit for automated evaluation against hidden oracle events ### Visual Exploration with the GUI The Meta Agents Research Environments framework includes a graphical user interface that allows you to visually explore scenarios, examine their structure, and understand the evaluation process. This is particularly useful for understanding how scenarios work before running automated evaluations. **Starting the GUI** Launch the GUI with your model configuration: ```shell uvx --from meta-agents-research-environments are-gui -a default --model your-model --provider your-provider ``` **Loading Gaia2 Scenarios** Follow these steps to explore Gaia2 scenarios in the GUI: 1. **Navigate to Scenarios Tab**: Click on the "Scenarios" tab in the interface ![Navigate to Scenarios Tab](./step1_scenarios_tab.png) 2. **Load Scenarios**: Click the "Load Scenarios" button ![Load Scenarios Button](./step2_load_scenarios.png) 3. **Select HuggingFace Source**: Choose "HuggingFace" as the data source ![Select HuggingFace Source](./step3_huggingface_source.png) 4. **Choose Gaia2 Dataset**: Select "Gaia2" from the available datasets 5. **Select Configuration and Split**: Choose a capability (e.g., "execution", "search", "mini") and split ("validation") 6. **Browse Scenarios**: Select any scenario from the list to view its details ![Browse and Select Scenario](./step6_browse_scenarios.png) The GUI provides a visual representation of: - Scenario structure and initial state - Event timeline and dependencies - User messages and expected agent responses - Universe context and available applications ![Scenario Apps Details View](./step7_scenario_details.png) **Benefits of GUI Exploration** - **Visual Understanding**: See how scenarios are structured and what events occur - **Interactive Debugging**: Step through scenarios to understand failure points - **Context Awareness**: Explore the simulated environment and available tools - **Educational Value**: Learn how different capability types are designed and evaluated ### Loading the Dataset Programmatically You can also load and work with the dataset directly using the Meta Agents Research Environments framework: ```py from datasets import load_dataset from are.simulation.data_handler.importer import JsonScenarioImporter # Load the dataset dataset = load_dataset("meta-agents-research-environments/gaia2") # Load specific configuration execution_data = load_dataset("meta-agents-research-environments/gaia2", name="execution", split="validation") # Load mini subset for quick testing mini_data = load_dataset("meta-agents-research-environments/gaia2", name="mini", split="validation") # Initialize the importer importer = JsonScenarioImporter() # Access individual scenarios and load them as benchmark scenarios for scenario in mini_data: scenario_id = scenario["scenario_id"] scenario_data = scenario["data"] # Load scenario using the from_benchmark API benchmark_scenario, completed_events, world_logs = importer.import_from_json_to_benchmark( json_str=scenario_data ) print(f"Loaded scenario {benchmark_scenario.scenario_id}") print(f"Number of completed events: {len(completed_events)}") print(f"Number of world logs: {len(world_logs)}") ``` ### Evaluation Metrics Gaia2 provides comprehensive evaluation metrics: - **Overall Success Rate**: Percentage of successful runs across all capabilities - **Per-Capability Breakdown**: Success rates for each of the seven capabilities - **Variance Analysis**: Statistical measures including pass@3, always succeed/fail rates - **Hierarchical Statistics**: Within-sample and between-sample standard deviations ### Example Scenarios by Capability **Execution**: "Update all my contacts aged 24 or younger to be one year older than they are currently." **Search**: "Which city do most of my friends live in? I consider any contact who I have at least one 1-on-1 conversation with on ChatsApp a friend." **Adaptability**: "Meet my friend to view a property. If she replies to suggest another property or time, please replace it with her suggestion." **Time**: "Send ChatsApp messages to colleagues. If after 3 minutes there is no response, order a default cab." **Ambiguity**: "Schedule a 1h Yoga event each day at 6:00 PM from October 16-21, 2024\. Ask me in case there are conflicts." ### Best Practices 1. **Start Small**: Begin with validation split and limited scenarios to test your setup 2. **Use Mini Config**: The mini configuration provides representative scenarios across all capabilities 3. **Multiple Runs**: Run scenarios multiple times (default: 3\) for statistical confidence 4. **Judge System**: Leverage the built-in judge system for immediate feedback during development 5. **Variance Analysis**: Pay attention to consistency metrics to understand agent reliability For detailed documentation and advanced usage, visit the [Meta Agents Research Environments documentation](https://github.com/facebookresearch/meta-agents-research-environments). ## Terms of Art **Concepts and Definitions:** - **Meta Agents Research Environments:** Simulated Interactive Multi-agent Systems framework - **Universe:** A simulated user environment with specific data (e.g. contacts, messages), and events - **Scenario:** A time-based simulation with events, tasks, and validation criteria - **Oracle Events:** Ground truth events used for automated verification - **Capability Categories:** Seven core dimensions of agent evaluation (Execution, Search, Adaptability, Time, Ambiguity, Agent2Agent, Noise) - **Dynamic Environment Events:** Time-dependent events that modify world state during scenario execution - **Agent2Agent:** Multi-agent collaboration scenarios where agents interact with other agents representing applications ## Citation If you use Meta Agents Research Environments in your work, please cite: ```bibtex @misc{andrews2025arescalingagentenvironments, title={ARE: Scaling Up Agent Environments and Evaluations}, author={Pierre Andrews and Amine Benhalloum and Gerard Moreno-Torres Bertran and Matteo Bettini and Amar Budhiraja and Ricardo Silveira Cabral and Virginie Do and Romain Froger and Emilien Garreau and Jean-Baptiste Gaya and Hugo Laurençon and Maxime Lecanu and Kunal Malkan and Dheeraj Mekala and Pierre Ménard and Grégoire Mialon and Ulyana Piterbarg and Mikhail Plekhanov and Mathieu Rita and Andrey Rusakov and Thomas Scialom and Vladislav Vorotilov and Mengjue Wang and Ian Yu}, year={2025}, eprint={2509.17158}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2509.17158}, } ```

提供机构：

maas

创建时间：

2025-09-26

搜集汇总

数据集介绍