lime-nlp/ExeVR-53k

Name: lime-nlp/ExeVR-53k
Creator: lime-nlp
Published: 2026-03-12 01:28:04
License: 暂无描述

Hugging Face2026-03-12 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/lime-nlp/ExeVR-53k

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - video-classification - visual-question-answering language: - en tags: - video - reward-model - computer-use - gui-agent - execution-verification - temporal-grounding pretty_name: "ExeVR-53k: Execution Video Reward Dataset" size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: train path: train/* - split: test path: test/* --- # ExeVR-53k: Execution Video Reward Dataset ExeVR-53k is a large-scale dataset for training and evaluating **Execution Video Reward Models (ExeVRM)** — vision-language models that judge whether a computer-using agent successfully completes a given task based on its screen recording. ## Overview | Split | Samples | Videos | Size | |-------|---------|--------|------| | Train | 53,904 | 53,904 | ~29 GB | | Test | 789 | 789 | ~358 MB | | **Total** | **54,693** | **54,693** | **~29.4 GB** | ## Usage Download the dataset: ```bash hf download lime-nlp/ExeVR-53k --repo-type dataset --local-dir ./ExeVR_53k ``` Reassembling training set videos ```bash cd path/to/your/zip_files cat train.tar.gz.part_* | tar xz ``` Decompressing test set videos ```bash tar -zxf test.tar.gz ``` ## Data Sources The dataset is constructed from agent trajectories across two sources: **1. OSWorld** (24,956 train / 189 test): Computer-using agent trajectories spanning from Ubuntu. **2. AgentNet** (46,892 train / 400 test): Desktop human trajectories spanning three platform splits — Ubuntu, Windows, and MacOS. **3. ScaleCUA** (7,012 train / 200 test): Multi-platform agent trajectories covering Ubuntu (3,062), Web (2,041), Android (1,002), Windows (582), and MacOS (358). | Platform | Train | Test | Description | |----------|-------|------|-------------| | **OSWorld** | 24,956 | 189 | Ubuntu GUI tasks sampled from CUA rollout | | **Ubuntu** | 7,675 | 200 | Ubuntu GUI tasks (AgentNet + ScaleCUA) | | **Win/Mac** | 18,263 | 200 | Windows/macOS desktop tasks (AgentNet + ScaleCUA) | | **Web** | 2,041 | — | Browser-based tasks (from ScaleCUA) | | **Android** | 1,002 | 200 | Android mobile tasks (from ScaleCUA) | | **Total** | **53,904** | **789** | | ### Label Distribution | Split | Correct | Incorrect | |-------|---------|-----------| | Train | 22,394 (41.5%) | 31,510 (58.5%) | | Test | 394 (49.9%) | 395 (50.1%) | ## Directory Structure ``` ExeVR_53k/ ├── README.md ├── train/ # 53,904 training videos │ ├── osworld_<id>_success.mp4 │ ├── osworld_<id>_failure.mp4 │ ├── ubuntu_<id>_success.mp4 │ ├── win_mac_<id>_success.mp4 │ ├── scalecua_<uuid>.mp4 │ └── ... ├── test/ # 789 test videos │ ├── osworld_<id>.mp4 │ ├── android_<uuid>.mp4 │ ├── ubuntu_<id>.mp4 │ ├── winmac_<id>.mp4 │ └── ... ├── test.tar.gz # Compressed test set └── train.tar.gz.part_[aa-af] # Compressed train set (5 GB shards) ``` ## Data Format Each sample is a JSON object following the ShareGPT conversation format, paired with a video file: ### Binary Reward (Correct / Incorrect) ```json { "conversations": [ { "from": "human", "value": "<video>Given a user task and a computer-using video recording, evaluate whether the user completes the task or not. Reply your judgement in the \\box{}.\nIf the video correctly completes the task, reply \\box{correct}. Otherwise, reply \\box{incorrect}.\n\n# User Task\nChange the slide background to purple.\n" }, { "from": "gpt", "value": "\\box{correct}" } ], "videos": ["/path/to/video.mp4"] } ``` ### With Temporal Grounding (Android / ScaleCUA subset) For incorrect Android samples, the label additionally includes a timestamp range indicating where the agent deviates from the instruction: ```json { "conversations": [ { "from": "human", "value": "<video>Given a user task and a computer-using video recording, evaluate whether the user completes the task or not. Reply your judgement in the \\box{}.\nIf the video correctly completes the task, reply \\box{correct}. Otherwise, reply \\box{incorrect}.\nIf the video does not complete the task (i.e., incorrect), please provide the timestemp range, i.e., from <[time_start] seconds> to <[time_end] seconds>, of the video that deviates from the user's instruction.\n\n# User Task\nFind the best-rated restaurant around CMU main campus\n" }, { "from": "gpt", "value": "\\box{incorrect}\nThe video deviates from the user's instruction between <3.0 seconds> and <4.0 seconds>." } ], "videos": ["/path/to/video.mp4"] } ``` ## Video Specifications - **Resolution**: 720p (1280x720) - **FPS**: 1 frame per second (sampled) - **Duration**: Varies per task (typically 10-60 seconds) - **Format**: MP4 ## Usage with ExeVRM The dataset is designed for use with the [ExeVRM](https://github.com/lime-nlp/ExeVRM) training framework. Annotation files are stored separately: - **Train annotations**: `ver53k.jsonl` (JSON list of 53,904 samples) - **Test annotations**: `verbench.jsonl` (JSON list of 789 samples) ## Citation If you use ExeVR-53k in your research, please cite our work: ``` @misc{song2026videobasedrewardmodelingcomputeruse, title={Video-Based Reward Modeling for Computer-Use Agents}, author={Linxin Song and Jieyu Zhang and Huanxin Sheng and Taiwei Shi and Gupta Rahul and Yang Liu and Ranjay Krishna and Jian Kang and Jieyu Zhao}, year={2026}, eprint={2603.10178}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2603.10178}, } ``` ## License This dataset is released under the Apache License 2.0.

提供机构：

lime-nlp

5,000+

优质数据集

54 个

任务类型

进入经典数据集