WebTailBench

Name: WebTailBench
Creator: maas
Published: 2026-01-07 03:42:32
License: 暂无描述

魔搭社区2026-01-07 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/microsoft/WebTailBench

下载链接

链接失效反馈

官方服务：

资源简介：

# WebTailBench: A Comprehensive Benchmark for Computer-Using Agents [![Microsoft](https://img.shields.io/badge/Microsoft-Project-0078D4?logo=microsoft)](https://aka.ms/msaif/fara) [![Hugging Face Model](https://img.shields.io/badge/🤗-Model-yellow)](https://huggingface.co/microsoft/fara-7b) [![Foundry](https://img.shields.io/badge/Azure-Foundry-0089D6)](https://aka.ms/foundry-fara-7b) [![Github](https://img.shields.io/badge/Github-181717?logo=github&logoColor=white)](https://github.com/microsoft/fara) Paper: [Fara-7B: An Efficient Agentic Model for Computer Use](https://huggingface.co/papers/2511.19663) ## Dataset Summary WebTailBench is a comprehensive evaluation benchmark designed to assess Computer-Using Agent (CUA) models' performance across diverse, realistic web-based tasks. This dataset was first released as part of our **Fara** tech report and is formally presented in the paper [Fara-7B: An Efficient Agentic Model for Computer Use](https://huggingface.co/papers/2511.19663). The benchmark consists of: - **WebTailBench (Main)**: 609 hand-verified tasks across 11 categories designed to test both breadth of skills and depth through complex, multi-step operations - **WebTailBench-Refusals**: 111 curated harmful tasks across 7 categories to evaluate agents' ability to appropriately refuse unsafe requests WebTailBench addresses critical gaps in existing benchmarks by providing: 1. **Expanded task diversity and coverage** - Including underrepresented task types like booking restaurant, hotel, and flight reservations, event tickets, real estate, and job searches 2. **Increased task complexity** - Multi-step and cross-site tasks that chain information across websites 3. **Realistic scenarios** - Tasks drawn from high-traffic webpages reflecting actual human information needs 4. **Objective evaluation** - Goal-oriented tasks with clear success criteria verified as specific and achievable by human annotators ## Key Features - **Realism**: Tasks taken from high-traffic webpages reflecting actual user behavior - **Coverage**: 11 task categories with sufficient examples per category to assess proficiency - **Objectivity**: Goal-oriented tasks with clear, actionable objectives - **Alignment**: Verification system that matches human assessments - **Freshness**: Tasks valid through November 2025 with periodic refresh capability - **Safety Testing**: Comprehensive refusals benchmark for harmful task detection ## Dataset Structure ### Main Benchmark (WebTailBench.tsv) **Data Fields**: - `benchmark`: Task category (e.g., "flights", "hotels", "shopping", "restaurants", "activities", "ticketing", "real-estate", "jobs", "shopping_list", "comparison_shopping", "compositional_tasks") - `subdir`: Unique task identifier - `task_summary`: Detailed task description with specific requirements and fallback instructions **Task Categories**: *Single-skill tasks (8 categories):* 1. **Shopping** (56 tasks) - E-commerce product searches and purchases 2. **Flights** (51 tasks) - Flight booking across multiple airlines 3. **Hotels** (52 tasks) - Hotel reservations and inquiries 4. **Restaurants** (52 tasks) - Restaurant searches and bookings 5. **Activities** (80 tasks) - Event and activity searches 6. **Ticketing** (57 tasks) - Event ticket purchases 7. **Real-Estate** (48 tasks) - Property searches and listings 8. **Jobs/Careers** (50 tasks) - Job search and application tasks *Multi-step tasks (3 categories):* 9. **Shopping List** (51 tasks) - Adding multiple items to cart 10. **Comparison Shopping** (57 tasks) - Cross-site price comparisons 11. **Compositional Tasks** (55 tasks) - Complex multi-step operations **Total**: 609 tasks ### Refusals Benchmark (WebTailBench-Refusals.tsv) **Data Fields**: - `TaskID`: Unique identifier for the harmful task - `TaskToBeRefused`: Description of the harmful task that should be refused **Harmful Task Categories** (111 total tasks across 7 categories): 1. Illegal activities 2. Deceptive tasks 3. High-risk domains 4. Harassment and hate 5. Irresponsible use of technology 6. Misinformation 7. Sexual content ## Dataset Creation ### Curation Rationale WebTailBench was created to address significant limitations in existing web agent benchmarks: - **Limited task diversity**: Most benchmarks lack sufficient coverage of common real-world tasks (e.g., Online-Mind2Web has only 3 flight booking tasks) - **Synthetic and ambiguous goals**: ~25% of existing tasks in Online-Mind2Web use vague instructions like "browse" or "find" without clear objectives, effectively measuring only navigation ability. - **Poor evaluation alignment**: Existing verifiers often don't align well with human judgment, for instance WebVoyager's evaluation does not use the model's final output or action history (see [this blog](https://tiancixue.notion.site/An-Illusion-of-Progress-Assessing-the-Current-State-of-Web-Agents-1ac6cd2b9aac80719cd6f68374aaf4b4?pvs=25#1ac6cd2b9aac8007a4b7fd9444102bcd)) ### Source Data Tasks are derived from high-traffic commercial websites across multiple domains, reflecting actual human information needs and behaviors. All 609 tasks in the main benchmark were hand-verified by human annotators to ensure achievability. ### Time Sensitivity Tasks are designed to remain valid through **November 2025**, after which periodic refreshes may occur. Some categories are particularly time-sensitive: - Flights, hotels, ticketing: Include specific dates or relative times - Restaurants: May close or change policies - Jobs: Positions may be filled or removed - Shopping: Products may be discontinued ## Benchmark Results ### Performance Overview (Main Benchmark) Breakdown of WebTailBench results for each of its 11 segments. Averages over three independent runs, penalizing any tasks which did not finish. The first 8 segments test a single skill or objective usually on a single website, the remaining three are more difficult multi-step or cross-site tasks. | **WebTailBench** | **Num Tasks** | **SoM 4.5** | **SoM o3** | **SoM 4o** | **GLM-4.1V 9B-Thinking** | **OAI Comp. Use-Prev** | **UI-TARS 1.5-7B** | **Fara 7B** | |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | | | ***SoM Agents*** | | | | ***Computer Use Models*** | | | | Shopping | 56 | 62.5 | 71.4 | 38.1 | 31.0 | 42.3 | 41.1 | 52.4 | | Flights | 51 | 60.1 | 39.2 | 11.1 | 10.5 | 17.6 | 10.5 | 37.9 | | Hotels | 52 | 68.6 | 56.4 | 31.4 | 19.9 | 26.9 | 35.3 | 53.8 | | Restaurants | 52 | 67.9 | 59.6 | 47.4 | 32.1 | 35.9 | 22.4 | 47.4 | | Activities | 80 | 70.4 | 62.9 | 41.7 | 26.3 | 30.4 | 9.6 | 36.3 | | Ticketing | 57 | 58.5 | 56.7 | 37.4 | 35.7 | 49.7 | 30.4 | 38.6 | | Real-Estate | 48 | 34.0 | 17.4 | 20.1 | 16.0 | 9.0 | 9.7 | 23.6 | | Jobs/Careers | 50 | 49.3 | 44.0 | 32.7 | 22.7 | 20.7 | 20.7 | 28.0 | | Shopping List (2 items) | 51 | 66.0 | 62.7 | 17.0 | 7.8 | 34.0 | 20.9 | 49.0 | | Comparison Shopping | 57 | 67.3 | 59.1 | 27.5 | 22.8 | 1.2 | 8.8 | 32.7 | | Compositional Tasks | 55 | 51.5 | 39.4 | 26.7 | 17.0 | 10.3 | 9.1 | 23.0 | | **Macro Avg.** | 609 | 59.7 | 51.7 | 30.1 | 22.0 | 25.3 | 19.9 | 38.4 | | **Micro Avg.** | 609 | 60.4 | 52.7 | 30.8 | 22.4 | 25.7 | 19.5 | 38.4 | ### Detailed Results by Category Performance varies significantly across categories, with models generally performing better on: - Simple tasks: Hotels (68.6% best), Activities (70.4% best), Restaurants (67.9% best) - More challenging: Real-Estate (34.0% best), Jobs (49.3% best), Compositional Tasks (51.5% best) - Some segments may have low scores due to common websites within them aggressively blocking bots ### Cost Efficiency Per-task WebTailBench statistics for different models. All metrics are reported per task. | **Model** | **Cost ($) per Task** | **Accuracy** | **Actions per Task** | **Input Tok per Task** | **Output Tok per Task** | |---|:---:|:---:|:---:|:---:|:---:| | ***SoM Agents*** | | | | | | | SoM Agent (4.5) | 0.595 | 60.4 | 29.8 ± 26.6 | 279k ± 343k | 17.6k ± 26.0k | | SoM Agent (o3) | 0.948 | 53.0 | 41.1 ± 34.2 | 390k ± 405k | 20.9k ± 23.4k | | SoM Agent (4o) | 0.418 | 30.0 | 18.4 ± 18.8 | 157k ± 237k | 2.6k ± 2.6k | | GLM-4.1V 9B-Thinking | 0.044 | 22.4 | 23.8 ± 27.9 | 117k ± 153k | 12.8k ± 15.6k | | ***Computer Use Models*** | | | | | | | OAI Comp. Use-Prev | 1.523 | 25.7 | 58.8 ± 35.4 | 493k ± 355k | 3.6k ± 2.2k | | UI-TARS 1.5-7B | 0.133 | 19.5 | 41.1 ± 32.4 | 659k ± 631k | 3.4k ± 2.9k | | Fara 7B | 0.069 | 38.4 | 41.1 ± 33.1 | 343k ± 323k | 2.4k ± 1.9k | ## Considerations for Using the Data ### Intended Use WebTailBench is designed for assessing breadth of skills and mastery of deeply chained tasks: - Evaluating computer-using agent models on realistic web tasks - Measuring both breadth (across 11 categories) and depth (multi-step tasks) of capabilities - Assessing safety through appropriate refusal of harmful requests - Benchmarking progress in web automation and agent intelligence ### Limitations - **Temporal validity**: Tasks expire after November 2025 and may become outdated earlier - **Website changes**: Tasks may break if websites restructure or change functionality - **Geographic constraints**: Some tasks may only work in specific regions - **Evaluation requirements**: Requires the Task Verification system for proper assessment - **Sold-out scenarios**: Tasks account for unavailable bookings, but this adds evaluation complexity ### Social Impact and Biases **Positive impacts**: - Advances research in helpful AI agents for everyday tasks - Provides safety evaluation through refusals benchmark - Encourages development of more capable and reliable automation **Potential concerns**: We advise running these evaluations in a sandboxed environment without access to sensitive or personal information (e.g. a credit card or delivery address) so that real-world effects are not manifested. Risks include: - Risk of agents executing harmful tasks if safety measures fail - Potential for unintended consequences that are hard to reverse, e.g. if agents successfully complete reservation booking for shopping tasks. **Known biases**: - Tasks reflect Western/English-speaking user patterns and websites - Limited representation of accessibility-focused tasks - Skewed toward commercial/transactional activities - Missing several segments that humans would value, e.g. finding a doctor, etc. ### Licensing Information MIT License ### Citation Information If you use Fara in your research, please cite our work: ```bibtex @article{Awadallah2025Fara7B, title={Fara-7B: An Efficient Agentic Model for Computer Use}, author={Ahmed Awadallah and Yash Lara and Raghav Magazine and Hussein Mozannar and Akshay Nambi and Yash Pandya and Aravind Rajeswaran and Corby Rosset and Alexey Taymanov and Vibhav Vineet and Spencer Whitehead and Andrew Zhao}, journal={arXiv preprint arXiv:2511.19663}, year={2025}, url={https://huggingface.co/papers/2511.19663} } ``` ### Contributions Created by Microsoft Research AI Frontiers. All tasks were hand-verified by human annotators to ensure quality and achievability. ### Task Verification System WebTailBench includes a Task Verification system that: - Provides reproducible evaluation methodology - Aligns more closely with human judgment than existing verifiers - Will be released alongside the benchmark dataset as part of the github repository forthcoming... ### Contact For questions or issues regarding WebTailBench, please contact [contact information to be added]. --- *Last updated: November 2025*

# WebTailBench：面向计算机使用智能体的综合评测基准 [![Microsoft](https://img.shields.io/badge/Microsoft-Project-0078D4?logo=microsoft)](https://aka.ms/msaif/fara) [![Hugging Face Model](https://img.shields.io/badge/🤗-Model-yellow)](https://huggingface.co/microsoft/fara-7b) [![Foundry](https://img.shields.io/badge/Azure-Foundry-0089D6)](https://aka.ms/foundry-fara-7b) [![Github](https://img.shields.io/badge/Github-181717?logo=github&logoColor=white)](https://github.com/microsoft/fara) 论文：[Fara-7B：一款高效的计算机使用智能体模型](https://huggingface.co/papers/2511.19663) ## 数据集概述 WebTailBench是一款综合性评测基准，旨在评估**计算机使用智能体（Computer-Using Agent, CUA）**模型在多样化、真实的网页任务中的表现。本数据集首次作为我们的**Fara**技术报告的一部分发布，并在论文[Fara-7B：一款高效的计算机使用智能体模型](https://huggingface.co/papers/2511.19663)中正式呈现。该评测基准包含两部分： - **WebTailBench（主基准）**：涵盖11个类别的609项人工验证任务，通过复杂的多步操作同时测试模型的技能广度与深度 - **WebTailBench-拒绝测试基准**：涵盖7个类别的111项精心整理的有害任务，用于评估智能体合理拒绝不安全请求的能力 WebTailBench填补了现有评测基准的关键空白，具体体现在： 1. **扩展的任务多样性与覆盖范围**：包含此前代表性不足的任务类型，如餐厅、酒店、航班预订，活动门票、房产与求职搜索等 2. **更高的任务复杂度**：多步骤、跨站点的任务，可在多个网站间链式传递信息 3. **真实场景化**：任务源自高流量网页，反映真实的人类信息需求 4. **客观化评测**：以目标为导向的任务，具备明确的成功标准，经人工标注者验证为具体且可实现的 ## 核心特性 - **真实性**：任务取自高流量网页，贴合真实用户行为 - **覆盖全面**：11个任务类别，每个类别具备足够样本量以评估模型熟练度 - **客观性**：以目标为导向的任务，具备清晰、可执行的目标 - **对齐度高**：验证系统与人类评估结果保持一致 - **时效性强**：任务有效期至2025年11月，支持定期更新 - **安全测试**：用于有害任务检测的全面拒绝测试基准 ## 数据集结构 ### 主基准（WebTailBench.tsv） **数据字段**： - `benchmark`：任务类别（例如：flights（航班）、hotels（酒店）、shopping（购物）、restaurants（餐厅）、activities（活动）、ticketing（票务）、real-estate（房产）、jobs（求职）、shopping_list（购物清单）、comparison_shopping（比价购物）、compositional_tasks（组合任务）） - `subdir`：唯一任务标识符 - `task_summary`：包含具体要求与备用指令的详细任务描述 **任务类别**： *单技能任务（8个类别）*： 1. **Shopping（购物）**（56项任务）：电商产品搜索与购买 2. **Flights（航班）**（51项任务）：多航空公司航班预订 3. **Hotels（酒店）**（52项任务）：酒店预订与咨询 4. **Restaurants（餐厅）**（52项任务）：餐厅搜索与预订 5. **Activities（活动）**（80项任务）：赛事与活动搜索 6. **Ticketing（票务）**（57项任务）：活动门票购买 7. **Real-Estate（房产）**（48项任务）：房产搜索与房源查看 8. **Jobs/Careers（求职/职业）**（50项任务）：求职与申请任务 *多步骤任务（3个类别）*： 9. **Shopping List（购物清单）**（51项任务）：向购物车添加多件商品 10. **Comparison Shopping（比价购物）**（57项任务）：跨站点价格对比 11. **Compositional Tasks（组合任务）**（55项任务）：复杂多步操作 **总计**：609项任务 ### 拒绝测试基准（WebTailBench-Refusals.tsv） **数据字段**： - `TaskID`：有害任务的唯一标识符 - `TaskToBeRefused`：需要被拒绝的有害任务描述 **有害任务类别**（涵盖7个类别，共111项任务）： 1. 非法活动 2. 欺诈性任务 3. 高风险领域 4. 骚扰与仇恨言论 5. 不负责任的技术使用 6. 虚假信息 7. 色情内容 ## 数据集构建 ### 整理依据 WebTailBench的构建旨在弥补现有网页智能体评测基准的显著局限： - **任务多样性不足**：大多数基准未能充分覆盖常见的现实世界任务（例如：Online-Mind2Web仅包含3项航班预订任务） - **合成且模糊的目标**：Online-Mind2Web中约25%的现有任务使用模糊指令，如"浏览"或"查找"，未明确目标，实际上仅能测试导航能力。 - **评估对齐度差**：现有验证系统往往与人类判断不符，例如WebVoyager的评估未使用模型的最终输出或操作历史（详见[此博客](https://tiancixue.notion.site/An-Illusion-of-Progress-Assessing-the-Current-State-of-Web-Agents-1ac6cd2b9aac80719cd6f68374aaf4b4?pvs=25#1ac6cd2b9aac8007a4b7fd9444102bcd)） ### 源数据任务源自多领域的高流量商业网站，反映真实的人类信息需求与行为。主基准中的全部609项任务均经人工标注者手动验证，确保其可实现性。 ### 时间敏感性任务设计有效期至**2025年11月**，之后可能会定期更新。部分类别对时间尤为敏感： - 航班、酒店、票务任务：包含具体日期或相对时间 - 餐厅任务：可能会停业或更改政策 - 求职任务：职位可能已招满或下架 - 购物任务：商品可能已停产 ## 评测基准结果 ### 主基准性能概览 WebTailBench11个细分赛道的性能拆解结果。结果为三次独立运行的平均值，对未完成的任务进行惩罚。前8个赛道通常测试单技能或单网站目标，剩余3个赛道为难度更高的多步骤或跨站点任务。 | **WebTailBench** | **任务数量** | **SoM 4.5** | **SoM o3** | **SoM 4o** | **GLM-4.1V 9B-Thinking** | **OAI Comp. Use-Prev** | **UI-TARS 1.5-7B** | **Fara 7B** | |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | | | ***SoM智能体*** | | | | ***计算机使用模型*** | | | | Shopping（购物） | 56 | 62.5 | 71.4 | 38.1 | 31.0 | 42.3 | 41.1 | 52.4 | | Flights（航班） | 51 | 60.1 | 39.2 | 11.1 | 10.5 | 17.6 | 10.5 | 37.9 | | Hotels（酒店） | 52 | 68.6 | 56.4 | 31.4 | 19.9 | 26.9 | 35.3 | 53.8 | | Restaurants（餐厅） | 52 | 67.9 | 59.6 | 47.4 | 32.1 | 35.9 | 22.4 | 47.4 | | Activities（活动） | 80 | 70.4 | 62.9 | 41.7 | 26.3 | 30.4 | 9.6 | 36.3 | | Ticketing（票务） | 57 | 58.5 | 56.7 | 37.4 | 35.7 | 49.7 | 30.4 | 38.6 | | Real-Estate（房产） | 48 | 34.0 | 17.4 | 20.1 | 16.0 | 9.0 | 9.7 | 23.6 | | Jobs/Careers（求职/职业） | 50 | 49.3 | 44.0 | 32.7 | 22.7 | 20.7 | 20.7 | 28.0 | | Shopping List（购物清单） | 51 | 66.0 | 62.7 | 17.0 | 7.8 | 34.0 | 20.9 | 49.0 | | Comparison Shopping（比价购物） | 57 | 67.3 | 59.1 | 27.5 | 22.8 | 1.2 | 8.8 | 32.7 | | Compositional Tasks（组合任务） | 55 | 51.5 | 39.4 | 26.7 | 17.0 | 10.3 | 9.1 | 23.0 | | **宏平均** | 609 | 59.7 | 51.7 | 30.1 | 22.0 | 25.3 | 19.9 | 38.4 | | **微平均** | 609 | 60.4 | 52.7 | 30.8 | 22.4 | 25.7 | 19.5 | 38.4 | ### 分类别详细结果不同类别间的性能差异显著，模型通常在以下任务中表现更佳： - 简单任务：酒店（最佳表现68.6%）、活动（最佳表现70.4%）、餐厅（最佳表现67.9%） - 更具挑战性的任务：房产（最佳表现34.0%）、求职（最佳表现49.3%）、组合任务（最佳表现51.5%） - 部分赛道得分较低，原因在于其所属网站会积极拦截机器人访问 ### 成本效率不同模型的单任务WebTailBench统计数据。所有指标均按单任务报告。 | **模型** | **单任务成本（美元）** | **准确率** | **单任务操作数** | **单任务输入Token数** | **单任务输出Token数** | |---|:---:|:---:|:---:|:---:|:---:| | ***SoM智能体*** | | | | | | | SoM Agent (4.5) | 0.595 | 60.4 | 29.8 ± 26.6 | 279k ± 343k | 17.6k ± 26.0k | | SoM Agent (o3) | 0.948 | 53.0 | 41.1 ± 34.2 | 390k ± 405k | 20.9k ± 23.4k | | SoM Agent (4o) | 0.418 | 30.0 | 18.4 ± 18.8 | 157k ± 237k | 2.6k ± 2.6k | | GLM-4.1V 9B-Thinking | 0.044 | 22.4 | 23.8 ± 27.9 | 117k ± 153k | 12.8k ± 15.6k | | ***计算机使用模型*** | | | | | | | OAI Comp. Use-Prev | 1.523 | 25.7 | 58.8 ± 35.4 | 493k ± 355k | 3.6k ± 2.2k | | UI-TARS 1.5-7B | 0.133 | 19.5 | 41.1 ± 32.4 | 659k ± 631k | 3.4k ± 2.9k | | Fara 7B | 0.069 | 38.4 | 41.1 ± 33.1 | 343k ± 323k | 2.4k ± 1.9k | ## 数据使用注意事项 ### 预期用途 WebTailBench旨在评估模型的技能广度与深度链式任务的掌握程度： - 针对真实网页任务评估计算机使用智能体模型 - 衡量模型的技能广度（覆盖11个类别）与能力深度（多步骤任务） - 通过合理拒绝有害请求的任务评估模型安全性 - 为网页自动化与智能体智能的研究进展提供评测基准 ### 局限性 - **时间有效性**：任务有效期至2025年11月，且可能提前过时 - **网站变更**：若网站重构或更改功能，任务可能无法正常运行 - **地域限制**：部分任务仅在特定地区可用 - **评测要求**：需搭配任务验证系统才能开展正确评估 - **售罄场景**：任务包含不可用的预订场景，但这增加了评估复杂度 ### 社会影响与偏差 **积极影响**： - 推动面向日常任务的实用型AI智能体研究进展 - 通过拒绝测试基准提供安全性评估 - 推动更具能力与可靠性的自动化系统开发 **潜在风险**：我们建议在沙箱环境中运行这些评测，且不访问敏感或个人信息（例如信用卡或配送地址），以避免产生真实世界的影响。风险包括： - 若安全措施失效，智能体可能执行有害任务 - 可能出现难以逆转的意外后果，例如智能体成功完成购物任务的预订流程。 **已知偏差**： - 任务反映西方/英语用户的使用模式与网站场景 - 对无障碍相关任务的覆盖不足 - 偏向商业/交易类活动 - 缺少若干人类常用的任务类别，例如寻找医生等。 ### 许可信息 MIT许可证 ### 引用信息若您在研究中使用Fara，请引用我们的工作： bibtex @article{Awadallah2025Fara7B, title={"Fara-7B: An Efficient Agentic Model for Computer Use"}, author={Ahmed Awadallah and Yash Lara and Raghav Magazine and Hussein Mozannar and Akshay Nambi and Yash Pandya and Aravind Rajeswaran and Corby Rosset and Alexey Taymanov and Vibhav Vineet and Spencer Whitehead and Andrew Zhao}, journal={arXiv preprint arXiv:2511.19663}, year={2025}, url={https://huggingface.co/papers/2511.19663} } ### 贡献者由微软研究院AI前沿团队创建。所有任务均经人工标注者手动验证，以确保质量与可实现性。 ### 任务验证系统 WebTailBench包含任务验证系统，该系统： - 提供可复现的评测方法 - 比现有验证系统更贴合人类判断 - 将随评测基准数据集一同在即将发布的GitHub仓库中开放。 ### 联系方式有关WebTailBench的问题或建议，请联系[待补充的联系方式] --- *最后更新时间：2025年11月*

提供机构：

maas

创建时间：

2025-11-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集