five

xlangai/computer-agent-arena

收藏
Hugging Face2025-09-03 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/xlangai/computer-agent-arena
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-4.0 tags: - agent - multimodal - computer-use - GUI - visual-agents - evaluation - benchmark pretty_name: "Computer Agent Arena: Evaluating Computer-Use Agents via Crowdsourcing from Real Users" size_categories: - 1K<n<10K --- # Computer Agent Arena: Evaluating Computer-Use Agents via Crowdsourcing from Real Users ## Dataset Description Computer Agent Arena is a comprehensive evaluation platform for multi-modal AI agents, particularly focusing on computer use and GUI interaction tasks. This dataset contains real interaction trajectories from various state-of-the-art AI agents performing complex computer tasks in controlled environments. The dataset includes: - **4,641 agent trajectories** across diverse computer tasks - **Multi-modal conversations** including text instructions, code actions, and visual observations - **Human evaluations** of agent performance and task completion - **Battle-style comparisons** between different agent systems ## Supported Tasks - **Computer Use Automation**: Agents performing real computer tasks like file management, web browsing, and application usage - **GUI Interaction**: Visual understanding and interaction with graphical user interfaces - **Multi-modal Reasoning**: Combining visual perception with action planning - **Agent Evaluation**: Comparative assessment of different agent architectures and capabilities ## Dataset Structure ### Files Overview - `agent_arena_data.jsonl`: Main dataset file containing all trajectory data (48MB) - `sessions.csv`: Metadata about evaluation sessions, agent configurations, and battle results (6.9MB) - `images.tar.gz`: Compressed archive of all screenshots from agent interactions (17GB) ### Data Format Each record in `agent_arena_data.jsonl` follows this structure: ```json { "task_id": "unique_task_identifier", "instruction": "Task description given to the agent", "human_eval_correctness": 0, // 0 or 1, indicating task success "model": "model_name (agent_method)", "traj": [ { "index": 1, "image": "images/task_id_step_1.png", // Screenshot path "value": { "thought": "Agent's reasoning about current step", "code": "pyautogui.click(x, y)" // Action code } } // ... more trajectory steps ] } ``` ## Dataset Statistics - **Total Trajectories**: 4,641 - **Avg Steps per Trajectory**: ~8-15 steps - **Total Images**: ~65,000 screenshots ## Data Collection Methodology The data was collected through Computer Agent Arena, a platform enabling: 1. **Controlled Evaluation Environment**: Standardized Ubuntu desktop with consistent software configurations 2. **Human-Agent Interaction**: Users provide natural language task instructions 3. **Multi-Agent Battles**: Head-to-head comparisons between different agent systems 4. **Human Evaluation**: Expert assessments of task completion and agent behavior quality 5. **Comprehensive Logging**: Full trajectory capture including thoughts, actions, and visual observations ## Ethical Considerations - All data was collected from consenting participants in controlled environments - No personal or sensitive information is included in trajectories - Screenshots have been filtered to remove any potentially identifying information - The platform focuses on capability evaluation rather than user surveillance ## Limitations - **Environment Constraint**: All tasks performed in Ubuntu desktop environment - **Task Scope**: Primarily focused on desktop/GUI interactions - **Evaluation Subjectivity**: Human evaluations may contain subjective judgments - **Model Versions**: Some agent models may have been updated since evaluation period - **Language**: Instructions and interactions are primarily in English ## Citation If you use this dataset in your research, please cite: ```bibtex @misc{agent_arena_2025, title={Agent Arena: A Multi-Agent Multi-Modal Evaluation Platform}, author={Agent Arena Team}, year={2025}, url={https://huggingface.co/datasets/agent-arena/agent-arena-data} } ``` ## License This dataset is released under CC-BY-4.0 license. You are free to: - Share and redistribute the material - Adapt, remix, transform, and build upon the material - Use for any purpose, including commercially With attribution to the original creators. ## Contact For questions about this dataset, please open an issue in the repository or contact the Computer Agent Arena team. ## Updates and Versions - **v1.0** (2025): Initial release with 4,641 trajectories - Future versions may include additional agent models and task domains --- *This dataset represents ongoing research in AI agent evaluation. Results and methodologies may evolve as the field advances.*
提供机构:
xlangai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作