ChrisHarig/OSWorld-Control

Name: ChrisHarig/OSWorld-Control
Creator: ChrisHarig
Published: 2026-04-11 05:34:43
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/ChrisHarig/OSWorld-Control

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: task_id dtype: string - name: domain dtype: string - name: instruction dtype: string - name: snapshot dtype: string - name: related_apps dtype: string - name: config dtype: string - name: evaluator dtype: string - name: evaluator_script dtype: string - name: web_dependent dtype: bool splits: - name: train num_examples: 339 configs: - config_name: default data_files: - split: train path: data/train-* license: apache-2.0 --- # OSWorld-Control A desktop-GUI task dataset used by the **OSWorld Control setting** in [Control Arena](https://github.com/ChrisHarig/control-arena), a fork of the UK AI Security Institute's open-source AI Control evaluation framework. All tasks are copies of [OSWorld](https://github.com/xlang-ai/osworld) as of April 10th 2026. ## Purpose The dataset holds 339 feasible desktop GUI tasks across 10 application domains (LibreOffice Calc, Writer, Impress, GIMP, Chrome, VS Code, Thunderbird, VLC, OS, multi-app), plus per-task evaluator scripts. Control Arena loads this dataset automatically at eval time: ```python from control_arena.settings.osworld_control import OSWorldControlSetting setting = OSWorldControlSetting() # loads from ChrisHarig/OSWorld-Control ``` ## Repository layout - `data/train-*.parquet` — the 339 tasks (one row per task). - `evaluators/` — shared evaluator modules (`runner.py`, `table.py`, `general.py`, `docs.py`, `gimp.py`, `slides.py`, `vlc.py`, `utils.py`). At setup time Control Arena downloads these to `/opt/evaluators/` inside the task VM. Per-task `evaluator_script` rows then `import` from them. ## Columns - **task_id** (str): unique identifier for the task (matches upstream OSWorld). - **domain** (str): application domain — one of `libreoffice_calc`, `libreoffice_writer`, `libreoffice_impress`, `gimp`, `chrome`, `vs_code`, `thunderbird`, `vlc`, `os`, `multi_apps`. - **instruction** (str): natural-language task instruction shown to the agent. - **snapshot** (str): upstream OSWorld VM snapshot the task was written against. - **related_apps** (JSON-encoded str): apps the task touches. - **config** (JSON-encoded str): upstream OSWorld per-task setup steps (window activation, file upload, URL prep, etc.). - **evaluator** (JSON-encoded str): upstream OSWorld evaluator spec (metric name, getter config, expected values). - **evaluator_script** (str): per-task Python script that imports the shared runner and calls `evaluate()`. Written to `/tmp/evaluator.py` on the VM at scoring time and executed there. - **web_dependent** (bool): `True` for 42 tasks whose evaluators depend on live web content (URL fetches, HTML parsing, external API responses). See the caveat below. ## Web-dependent tasks — important for comparing results over time 42 of the 339 tasks are flagged `web_dependent=True`. Their evaluators hit live webpages, so when a target website restructures the evaluator's expected data may stop matching what the agent sees. Producing false negatives even when the model correctly completed the task. This is a known issue inherited from upstream OSWorld. We plan to keep this dataset up to date as upstream fixes land. In the meantime: - **Scores on web-dependent tasks are not directly comparable across time.** A task that scored 0.8 last quarter may score 0.4 today purely because the target page changed. - For reproducible benchmarks, filter the dataset with `web_dependent == False` (leaves 297 tasks). Control Arena exposes this as `OSWorldControlSetting(exclude_web_dependent=True)`. - For across-time comparisons pin the dataset revision to a specific HuggingFace git SHA and re-run honest baselines alongside any new attack runs so you can detect drift. - Web drift affects all models in a single eval run equally, so it does not bias relative comparisons within one run. ## Attribution Task definitions and evaluator logic are derived from [xlang-ai/OSWorld](https://github.com/xlang-ai/OSWorld) by XLANG NLP Lab, licensed under Apache License 2.0 (`license: apache-2.0` above). The restructuring to run evaluators inside the VM (one `evaluator_script` column per task, shared modules under `evaluators/`) follows the BashArena pattern used elsewhere in Control Arena. Modifications © Chris Harig. Upstream repository: https://github.com/xlang-ai/OSWorld Upstream LICENSE: https://github.com/xlang-ai/OSWorld/blob/main/LICENSE

提供机构：

ChrisHarig

搜集汇总

数据集介绍

构建方式

在桌面图形用户界面自动化评估领域，OSWorld-Control数据集通过系统化整合与重构构建而成。其核心任务源自XLANG NLP实验室的OSWorld项目，截至2026年4月10日，共收录了涵盖十个应用领域的339项可行任务。构建过程中，每个任务均保留了原始的任务标识、自然语言指令、虚拟机快照及评估器规范，并创新性地将评估逻辑转化为每任务独立的Python脚本列，同时辅以共享的评估模块，以适应Control Arena评估框架在虚拟机内部执行评分的需求。

特点

该数据集展现出多维度特性，其任务覆盖了LibreOffice套件、GIMP、Chrome等十个核心桌面应用领域，确保了评估场景的多样性。每项任务均配备详尽的配置信息与评估脚本，实现了自动化评估的闭环。尤为值得注意的是，数据集中标注了42项依赖实时网络内容的任务，这类任务因外部网页结构变动可能导致评估结果随时间漂移，为保障评估的复现性，用户可选择过滤此类任务，从而专注于297项稳定性更高的基准测试。

使用方法

在具体应用时，研究者可通过Control Arena框架直接加载此数据集进行自动化评估。初始化设置时，调用`OSWorldControlSetting()`即可自动载入任务定义与评估逻辑。评估脚本会在任务虚拟机内部执行，通过导入共享模块并调用评估函数来完成评分。对于需要跨时间比较的基准研究，建议通过`exclude_web_dependent`参数排除网络依赖任务，或固定数据集版本哈希值以控制变量，从而确保评估结果的可比性与科学性。

背景与挑战

背景概述

OSWorld-Control数据集源于2026年，由Chris Harig基于XLANG NLP实验室的OSWorld项目构建，并集成于Control Arena评估框架。该数据集专注于桌面图形用户界面（GUI）的自动化任务执行，涵盖LibreOffice、GIMP、Chrome等十个应用领域的339项可行任务，旨在为人工智能代理在真实操作系统环境中的控制能力提供标准化测试基准。其创建响应了人机交互与具身智能领域对复杂、跨应用任务自动化评估的迫切需求，通过提供详尽的指令、虚拟机快照及评估脚本，推动了智能体在开放世界环境中任务完成度的可靠度量。

当前挑战

数据集面临的挑战主要体现在两个方面：在领域问题层面，它致力于解决智能体在多样化的桌面GUI环境中执行跨应用、多步骤任务的泛化与鲁棒性难题，要求模型理解自然语言指令并精准操作图形界面元素；在构建过程中，数据集继承了上游OSWorld中42项网络依赖任务带来的评估稳定性问题，这些任务的评估脚本依赖于实时网页内容，外部网站的结构变更可能导致评估结果失真，从而影响跨时间比较的可靠性，这要求研究者在基准测试中需谨慎处理网络依赖任务或采用修订锁定策略以确保评估的一致性。

常用场景

经典使用场景

在桌面图形用户界面自动化与智能体控制研究领域，OSWorld-Control数据集作为基准测试的核心资源，其经典使用场景聚焦于评估人工智能模型在真实操作系统环境中执行多步骤任务的能力。该数据集通过模拟日常办公与多媒体应用操作，如使用LibreOffice处理文档、通过Chrome浏览器进行网络交互或在VS Code中编写代码，为研究者提供了一个标准化的测试平台，以系统性地衡量模型在复杂GUI环境下的规划、执行与适应性能。

衍生相关工作

围绕OSWorld-Control数据集，已衍生出多项重要的相关研究工作。其直接支撑了Control Arena评估框架中OSWorld控制场景的实现，为AI安全性与控制能力测评提供了关键实验环境。此外，该数据集继承自上游的OSWorld项目，后者已在多模态大模型与具身智能研究社区中被广泛引用，催生了众多关于跨应用任务规划、GUI元素理解以及长序列动作生成的前沿算法探索，持续推动着桌面自动化智能体技术的进步。

数据集最近研究