five

amazon-agi/Amazon-Nova-Act-v1.0-evals

收藏
Hugging Face2025-12-02 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/amazon-agi/Amazon-Nova-Act-v1.0-evals
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 configs: - config_name: real_bench_v2 data_files: - split: eval path: "REAL Bench V2/data-*.arrow" - config_name: real_bench_v1 data_files: - split: eval path: "REAL Bench V1/data-*.arrow" - config_name: screenspot_v2_web_text data_files: - split: eval path: "ScreenSpot V2 Web Text/data-*.arrow" - config_name: screenspot_v2_web_icon data_files: - split: eval path: "ScreenSpot V2 Web Icon/data-*.arrow" - config_name: workarena_l1 data_files: - split: eval path: "WorkArena/data-*.arrow" --- # Dataset Card for Amazon Nova Act v1.0 Evaluation This dataset shares additional details of the settings and methodology used in evaluating `nova-act-v1.0`, a custom Nova 2 Lite model that powers the Amazon Nova Act AWS service. All scores are reported mean@5 unless indicated otherwise. Nova Act is evaluated in a pure vision setting (i.e., no DOM or accessibility trees are provided). Responses from Nova Act in this dataset are released under a CC-BY-NC license. The public benchmarks used to generate responses can be accessed via the hyperlinks provided below, subject to the applicable license terms for each benchmark. The `nova-act-v1.0` model was trained on licensed data, proprietary data, open-source datasets, and publicly available data. ## Benchmark - REAL Bench v1 *References:* https://arxiv.org/abs/2504.11543; https://realevals.xyz/ REAL Bench is a “controlled environment where AI agents interact with realistic website replicas to test complex tasks.” **Measurement methodology**<br> All models were limited to a maximum trajectory length of 70 steps. Claude models were evaluated in a custom harness using Playwright for browser use, based on the Anthropic [reference code](https://github.com/anthropics/claude-quickstarts/tree/main/computer-use-demo) for computer use. This harness used the following prompt: ``` <SYSTEM_CAPABILITY> * You are a web browser agent. * You are provided with a task you are trying to complete, which may require multiple actions. * You should utilize the computer tool to perform these actions to complete the task. * You already have a web browser open and are viewing the correct starting page for the task. You cannot manually navigate to any other page and you cannot use any applications besides this web browser that is already open for you. * You should start by taking a screenshot to view the starting web page. * You may never ask for user input. At every step, you should either request use of the computer tool, respond that the task has been completed, or respond that the task cannot be completed and explain why. * If the task is asking you to return some information, then your final response should end with a line that has `ANSWER: <your answer>` and nothing else. * Do not attempt to do anything that is not explicitly required to complete the task given. Do not take any initiative. When you have completed the explicit task given, then simply indicate as such. Do not proceed with any potential followup actions that you were not explicitly instructed to do. * The current date is {datetime.today().strftime("%A, %B %-d, %Y")}. </SYSTEM_CAPABILITY> <IMPORTANT> * Begin by taking a screenshot. * Never ask a question or for user input. I cannot provide more context or respond to questions or requests. This task descripition is the only non tool response you will receive from me. You must do your best to pick the next computer tool actions to complete the task. * If the task is asking you to return some information, then your final response should end with a line that has `ANSWER: <your answer>` and nothing else. Give the minimal answer possible that provides the desired answer. Do not repeat extraneous information from the question or form a complete sentence if not necessary. * If given a complex task, break down into smaller steps and ask the user for details only if necessary * Read through web pages thoroughly by scrolling down till you have gathered enough info * Be concise! * Complete the task as requested, then stop. * If a question cannot be answered but a schema is requested YOU MUST RETURN AN ANSWER FOLLOWING THAT SCHEMA! </IMPORTANT> ``` Claude was presented the three most recent screenshots at each step. Many tasks in REAL Bench v1 and v2 require entering personal information not supplied in the prompt. For example, the model must provide an email and phone number to complete the task “Book me a reservation at an Italian restaurant for today at 3pm” in the OpenDining environment. Nova Act is trained to enter only information explicitly supplied in the prompt, so we append the following prompt text to every task: > If the task involves checking out a restaurant or requesting a tour, do not ask the human for help and complete all the forms to the end. Specifically, if the page is asking for phone numbers or email addresses, make sure to provide any valid phone number or email address into the fields. If there isn't enough information in the task for certain parts of the page, feel free to put any information or select any fields. This prompt appendix was provided to all models. We observed higher performance in all cases. **Training methodology**<br> REAL Bench mimics popular web properties that are often within the training distribution for web agents. Nova Act was trained on these REAL Bench replicas, under consultation with benchmark authors, using training tasks generated by Amazon without reference to the test set. ## Benchmark - REAL Bench v2 *Reference:* https://github.com/agi-inc/agisdk REAL Bench v2 is an update to the REAL Bench v1 task set. **Measurement methodology**<br> All models were limited to a maximum trajectory length of 70 steps. The same prompt appendix in REAL Bench v1 was used for all models in REAL Bench v2. ## Benchmark - WorkArena L1 *Reference:* https://servicenow.github.io/WorkArena/ “WorkArena is a suite of browser-based tasks tailored to gauge web agents' effectiveness in supporting routine tasks for knowledge workers.” **Measurement methodology**<br> The WorkArena benchmark defines 33 task templates with six sampled configurations each. We further subsampled/shuffled these into larger shards to make a 330-task test set. Each task was evaluated with a maximum of 30 steps and a per-task timeout of 1000s for all models. Each task logs in to the hosted ServiceNow instance and navigates to the starting URL before initiating the agent run. Task verifiers used the DOM (for dashboards, forms, and charts) or the model response (extraction tasks) to assign a binary score to the agent rollout. Claude models were executed in the same harness as described for REAL Bench v1. Following the Claude [documentation](https://platform.claude.com/docs/en/agents-and-tools/tool-use/computer-use-tool), we appended the following instruction to the above `<SYSTEM_CAPABILITY>` prompt: > Some UI elements (like dropdowns and scrollbars) might be tricky to manipulate using mouse movements. If you experience this, try to use keyboard shortcuts. WorkArena authors provided guidance in conducting this evaluation. ## Benchmark - ScreenSpot V2 Web *Reference:* [OS-ATLAS: A Foundation Action Model for Generalist GUI Agents](https://arxiv.org/abs/2410.23218) ScreenSpot V2 is a successor to the [ScreenSpot benchmark](https://arxiv.org/abs/2401.10935), which “assesses single-step GUI grounding capabilities across multiple platforms”. We assessed `nova-act-v1.0` on the subset of ScreenSpot V2 focused on web element grounding. **Measurement methodology**<br> These tasks measure the model’s ability to correctly locate text and icons on webpage screenshots. They contain tasks such as “Click on view all users” with an accompanying screenshot. The model is evaluated on whether it correctly clicks on a point within the target bounding box. Claude models were queried with the benchmark image and the following prompt: ``` <SYSTEM_CAPABILITY> * You are utilising a computer system which provides you with a screenshot image of the current screen. * You will be given a query of what you need to click on in the screenshot. * Always just proceed with the best mouse_move tool_use action that will accomplish the desired task for the given query and screenshot. * The system has no features or tools available to you other than the ability to move the mouse cursor and click on the screen. * You may never ask for user input. * You may never ask the system for a screenshot - the screenshot is already provided to you. </SYSTEM_CAPABILITY> Here is the screenshot of the current screen. Click on {locate_query} in this screenshot and tell me the coordinates. Explain your reasoning for the chosen coordinates before clicking. ``` For Nova Act, we provided the default Nova Act system prompt and the task prompt `“Click on {locate_query}”`.
提供机构:
amazon-agi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作