five

McGill-NLP/A3-Synth

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/McGill-NLP/A3-Synth
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en size_categories: - 10K<n<100K tags: - agents - web - synthetic - sft task_categories: - text-generation --- <div align="center"> # A3-Synth | [**💾 Code**](https://github.com/McGill-NLP/agent-as-annotators) | [**📄 Paper**](https://arxiv.org/abs/2604.07776) | [**🌐 Website**](https://agent-as-annotators.github.io) | | :--: | :--: | :--: | | [**🤗 Dataset**](https://huggingface.co/datasets/McGill-NLP/A3-Synth) | [**🤖 Models**](https://huggingface.co/collections/McGill-NLP/a3-agent-as-annotators-69d854ab5b1993b10efc3fba) | [**📦 PyPI**](https://pypi.org/project/agent-as-annotators/) | [**Structured Distillation of Web Agent Capabilities Enables Generalization**](https://arxiv.org/abs/2604.07776) *Xing Han Lù, Siva Reddy* </div> A3-Synth is a synthetic training dataset for web agents, generated using the Agent-as-Annotators (A3) framework. It contains ~16k SFT training examples produced by Gemini 3 Pro acting as the Annotator across 3,000 tasks on 6 WebArena environments. ## Dataset Structure ``` A3-Synth/ training/ train.jsonl # 16k SFT examples (conversations with screenshots) tasks/ {site}-0.tasks.json # Task configs for 6 WebArena sites personas/ personas.json # 250 generated personas raw/ websynth.*.json # 2,999 full trajectory JSONs trajectories/ cleaned/screenshots/ # Step-by-step screenshots referenced by train.jsonl ``` ## Loading the Training Data ```python import json with open("training/train.jsonl") as f: examples = [json.loads(line) for line in f] # Each example is a list of messages: [system, user, assistant, user, assistant, ...] # User messages contain text + image references (screenshot file paths) # Assistant messages contain the agent's reasoning and actions ``` ## Sites | Site | Description | |------|-------------| | shopping | E-commerce (OneStopShop) | | shopping_admin | E-commerce admin panel | | reddit | Forum (Reddit-like) | | gitlab | Code hosting (GitLab) | | wikipedia | Encyclopedia | | map | OpenStreetMap |
提供机构:
McGill-NLP
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作