insta-150k
收藏InSTA 数据集概述
数据集基本信息
- 名称: InSTA (Internet-Scale Training For Agents)
- 开发者: Brandon Trabucco (1), Gunnar Sigurdsson (2), Robinson Piramuthu (2), Ruslan Salakhutdinov (1)
- (1) Carnegie Mellon University, Machine Learning Department
- (2) Amazon
- 论文地址: https://arxiv.org/abs/2502.06776
- 数据集地址: https://huggingface.co/datasets/data-for-agents/insta-150k
- 官网: https://data-for-agents.github.io
数据集描述
- 目的: 为网络导航代理提供互联网规模的训练数据,无需人工标注
- 规模: 覆盖150k个多样化网站
- 特点:
- 使用LLM生成任务、完成任务并生成轨迹
- LLM审核轨迹并判断其成功性
- 语言模型在检测有害内容、生成可行任务和判断成功轨迹方面表现良好
数据集性能
- 模型性能:
- 有害内容检测准确率: 97%
- 可行任务生成率: 89%
- 成功轨迹判断准确率: 82.6%
- Llama 3.1 70B模型任务解决率: 16.7%
数据集应用
- 训练效果:
- 在Mind2Web和WebLINX数据集上,Step Accuracy提升分别达+89.5%和+122.1%
- 在WebLINX和Mind2Web上,泛化能力提升分别达+149.0%和+156.3%
快速开始指南
-
环境准备: bash docker pull brandontrabucco/insta-browser-environment docker run -p 7860:7860 -p 3000-3007:3000-3007 -t brandontrabucco/insta-browser-environment &
-
代码安装: bash git clone https://github.com/data-for-agents/insta cd insta && pip install -e .
-
启动vLLM服务: bash export MODEL_NAME="meta-llama/Llama-3.3-70B-Instruct" bash start_vllm_server.sh
示例代码
python from insta import ( InstaPipeline, create_demo_videos )
dataset = [ {"domain": "duckduckgo.com", "task": "retrieve a news article on US politics"}, ]
pipeline = InstaPipeline() pipeline.launch(dataset=dataset)
create_demo_videos( task_is_feasible_threshold=0.0, success_threshold=0.0, on_right_track_threshold=0.0, )
Gym环境与工具
-
Gym环境: python from insta import InstaEnv, BrowserAgent env = InstaEnv() agent = BrowserAgent()
-
工具:
InstaTransformersGradioTool: 用于远程使用InstaTransformersTool: 用于本地环境
引用
bibtex @misc{Trabucco2025InSTA, title={InSTA: Towards Internet-Scale Training For Agents}, author={Brandon Trabucco and Gunnar Sigurdsson and Robinson Piramuthu and Ruslan Salakhutdinov}, year={2025}, eprint={2502.06776}, archivePrefix={arXiv}, primaryClass={cs.LG}, }




