BrowserART

Name: BrowserART
Creator: maas
Published: 2025-12-05 16:51:00
License: 暂无描述

魔搭社区2025-12-05 更新2025-09-27 收录

下载链接：

https://modelscope.cn/datasets/ScaleAI/BrowserART

下载链接

链接失效反馈

官方服务：

资源简介：

# 经拒绝训练的大语言模型作为浏览器AI智能体时极易被越狱 <a href="https://static.scale.com/uploads/6691558a94899f2f65a87a75/browser_art_draft_preview.pdf" style="text-decoration:none"> <button>论文PDF</button> </a> <a href="https://scale.com/research/browser-art" style="text-decoration:none"> <button>项目主页</button> </a> <a href="https://github.com/scaleapi/browser-art" style="text-decoration:none"> <button>GitHub</button> </a> <style> button { /* margin: calc(20vw / 100); */ margin: 0.5em; padding-left: calc(40vw / 100); padding-right: calc(40vw / 100); padding-bottom: calc(0vw / 100); text-align: center; font-size: 12px; height: 25px; transition: 0.5s; background-size: 200% auto; color: white; border-radius: calc(60vw / 100); display: inline; /* border: 2px solid black; */ font-weight: 500; box-shadow: 0px 0px 14px -7px #f09819; background-image: linear-gradient(45deg, #64F 0%, #000000 51%, #FF512F 100%); cursor: pointer; user-select: none; -webkit-user-select: none; touch-action: manipulation; } button:hover { background-position: right center; color: #fff; text-decoration: none; } button:active { transform: scale(0.95); } </style> 本项目包含BrowserART中的行为数据集，该工具包是专为浏览器AI智能体定制的红队测试套件。 ![](media/main_figure.png) ## 摘要出于安全考量，大语言模型（LLM）会被训练为拒绝执行有害的用户指令，例如协助开展危险活动。本研究旨在探讨一个开放性问题：通常在对话场景中生效的安全拒绝机制，能否推广至非对话式以及智能体式的应用场景中？与对话机器人不同，配备通用工具（如网页浏览器、移动设备）的大语言模型智能体能够直接对现实世界产生影响，因此拒绝有害指令的必要性更为凸显。本研究主要聚焦于浏览器智能体的红队测试——即通过网页浏览器操控信息的大语言模型智能体。为此，我们推出了**浏览器智能体红队测试工具包（Browser Agent Red teaming Toolkit，BrowserART）**，这是一款专为浏览器智能体红队测试打造的综合性测试套件。BrowserART包含100种与浏览器相关的多样化有害行为（包括原创行为以及从HarmBench[[Mazeika等人，2024]](https://arxiv.org/abs/2402.04249)和AirBench 2024[[Zeng等人，2024b]](https://arxiv.org/abs/2407.17436)中选取的行为），覆盖合成网站与真实网站两类场景。我们针对当前主流浏览器智能体开展的实证研究表明：尽管作为对话机器人的基础大语言模型会拒绝有害指令，但对应的智能体却不会这么做。此外，在对话场景中用于破解经拒绝训练的大语言模型的攻击方法，能够有效迁移至浏览器智能体中。经过人类改写的提示下，基于GPT-4o和o1-preview的浏览器智能体分别尝试了100项有害行为中的98项与63项。我们公开发布BrowserART，并呼吁大语言模型开发者、政策制定者以及智能体开发者携手合作，共同提升智能体的安全性。 ## BrowserART 行为数据集 ![](media/pie_chart.png) BrowserART包含100种浏览器相关的有害行为（包括原创行为以及从HarmBench[Mazeika等人，2024]和AirBench 2024[Zeng等人，2024b]中选取的行为），均为智能体不应协助执行的行为。我们将所有行为划分为两大主要类别：有害内容与有害交互。在每个主类别下，我们依据危害语义创建了子分类。我们针对19个域名创建了40个合成网站，用于开展针对特定网站（如Twitter/X）的浏览器行为红队测试。这些合成页面均在本地托管，可在沙箱环境中开展红队实验，不会对现实世界造成污染，尤其是社交媒体与政府网站。若您在研究中使用BrowserART的行为数据集，请除了引用本论文外，同时引用HarmBench与AirBench 2024，引用格式如下： @misc{kumar2024refusaltrainedllmseasilyjailbroken, title={Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents}, author={Priyanshu Kumar and Elaine Lau and Saranya Vijayakumar and Tu Trinh and Scale Red Team and Elaine Chang and Vaughn Robinson and Sean Hendryx and Shuyan Zhou and Matt Fredrikson and Summer Yue and Zifan Wang}, year={2024}, eprint={2410.13886}, archivePrefix={arXiv}, primaryClass={cs.CR}, url={https://arxiv.org/abs/2410.13886}, } @InProceedings{mazeika2024harmbench, title = {{H}arm{B}ench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal}, author = {Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, year = {2024}, series = {Proceedings of Machine Learning Research}, publisher = {PMLR}, } @article{zeng2024air, title={AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies}, author={Zeng, Yi and Yang, Yu and Zhou, Andy and Tan, Jeffrey Ziwei and Tu, Yuheng and Mai, Yifan and Klyman, Kevin and Pan, Minzhou and Jia, Ruoxi and Song, Dawn and others}, journal={arXiv preprint arXiv:2407.17436}, year={2024} } ## BrowserART 网站合成网站的源代码托管于我们的[GitHub页面](https://github.com/scaleapi/browser-art)。 ## 伦理与披露本研究——包括论文中详述的研究方法、代码以及本网页的内容——包含可能让使用者借助部分公开可用的大语言模型智能体生成有害内容的材料。尽管我们意识到相关风险，但我们认为完整公开本研究至关重要。除本研究使用的框架外，其他智能体框架均已公开且相对易于使用。任何试图利用语言模型生成有害内容与交互的坚定团队，都不可避免地能够取得类似的成果。在发布BrowserART与本研究的核心结果时，我们仔细权衡了提升防御鲁棒性相关研究的收益，与助长进一步恶意使用的风险。参考[Zou等人（2024）](https://llm-attacks.org/)的研究，我们认为发表本研究有助于智能体安全研究领域直面这一前沿挑战。在发布前，我们已将本研究的发现与数据集告知提供模型API访问的企业，以及浏览器智能体框架的开发者。我们的研究结果凸显了对话机器人与浏览器智能体之间存在的关键对齐鸿沟，并呼吁研究社区探索针对大语言模型智能体的安全防护技术。

提供机构：

maas

创建时间：

2025-09-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集