five

BrowserART

收藏
魔搭社区2025-12-05 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/ScaleAI/BrowserART
下载链接
链接失效反馈
官方服务:
资源简介:
<style> button { /* margin: calc(20vw / 100); */ margin: 0.5em; padding-left: calc(40vw / 100); padding-right: calc(40vw / 100); padding-bottom: calc(0vw / 100); text-align: center; font-size: 12px; height: 25px; transition: 0.5s; background-size: 200% auto; color: white; border-radius: calc(60vw / 100); display: inline; /* border: 2px solid black; */ font-weight: 500; box-shadow: 0px 0px 14px -7px #f09819; background-image: linear-gradient(45deg, #64F 0%, #000000 51%, #FF512F 100%); cursor: pointer; user-select: none; -webkit-user-select: none; touch-action: manipulation; } button:hover { background-position: right center; color: #fff; text-decoration: none; } button:active { transform: scale(0.95); } </style> # Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents <a href="https://static.scale.com/uploads/6691558a94899f2f65a87a75/browser_art_draft_preview.pdf" style="text-decoration:none"> <button>Paper PDF</button> </a> <a href="https://scale.com/research/browser-art" style="text-decoration:none"> <button>Homepage</button> </a> <a href="https://github.com/scaleapi/browser-art" style="text-decoration:none"> <button>Github</button> </a> This project contains the behavior dataset in BrowserART, a red teaming test suit tailored particularly for browser agents. ![](media/main_figure.png) ## Abstract For safety reasons, large language models (LLMs) are trained to refuse harmful user instructions, such as assisting dangerous activities. We study an open question in this work: Can the desired safety refusal, typically enforced in chat contexts, be generalized to non-chat and agentic use cases? Unlike chatbots, LLM agents equipped with general-purpose tools, such as web browsers and mobile devices, can directly influence the real world, making it even more crucial to refuse harmful instructions. In this work, we primarily focus on red-teaming browser agents – LLMs that manipulate information via web browsers. To this end, we introduce **Browser Agent Red teaming Toolkit (BrowserART)**, a comprehensive test suite designed specifically for red-teaming browser agents. BrowserART consists of 100 diverse browser-related harmful behaviors (including original behaviors and ones sourced from HarmBench [[Mazeika et al., 2024]](https://arxiv.org/abs/2402.04249) and[AirBench 2024 [[Zeng et al., 2024b]](https://arxiv.org/abs/2407.17436)) across both synthetic and real websites. Our empirical study on state-of-the-art browser agents reveals that, while the backbone LLM refuses harmful instructions as a chatbot, the corresponding agent does not. Moreover, attack methods designed to jailbreak refusal-trained LLMs in the chat settings transfer effectively to browser agents. With human rewrites, GPT-4o and o1-preview-based browser agents attempted 98 and 63 harmful behaviors (out of 100), respectively. We publicly release BrowserART and call on LLM developers, policymakers, and agent developers to collaborate on improving agent safety. ## BrowserART Behavior Dataset ![](media/pie_chart.png) BrowserART consists of 100 harmful browser-related behaviors (including original behaviors and ones sourced from HarmBench [Mazeika et al., 2024] and AirBench 2024 [Zeng et al., 2024b]) that an agent is not supposed to assist. We divided all behaviors into two main categories: harmful content and harmful interaction. Under each main category, we created sub-categories for the harm semantics. We created 40 synthetic websites under 19 domains for red teaming browser behaviors that target specific websites (e.g., Twitter/X). These synthetic pages are hosted locally for running red teaming experiments in a sandbox without polluting the real world, especially the social media and government sites. If you are using the behavior set of BrowserART, in addition to this work, please consider to cite HarmBench and AirBench 2024 using the following citations: ``` @misc{kumar2024refusaltrainedllmseasilyjailbroken, title={Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents}, author={Priyanshu Kumar and Elaine Lau and Saranya Vijayakumar and Tu Trinh and Scale Red Team and Elaine Chang and Vaughn Robinson and Sean Hendryx and Shuyan Zhou and Matt Fredrikson and Summer Yue and Zifan Wang}, year={2024}, eprint={2410.13886}, archivePrefix={arXiv}, primaryClass={cs.CR}, url={https://arxiv.org/abs/2410.13886}, } @InProceedings{mazeika2024harmbench, title = {{H}arm{B}ench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal}, author = {Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, year = {2024}, series = {Proceedings of Machine Learning Research}, publisher = {PMLR}, } @article{zeng2024air, title={AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies}, author={Zeng, Yi and Yang, Yu and Zhou, Andy and Tan, Jeffrey Ziwei and Tu, Yuheng and Mai, Yifan and Klyman, Kevin and Pan, Minzhou and Jia, Ruoxi and Song, Dawn and others}, journal={arXiv preprint arXiv:2407.17436}, year={2024} } ``` ## BrowserART Websites The source code of synthetic websites are hosted at our [Github page](https://github.com/scaleapi/browser-art). ## Ethics and Disclosure This research — including the methodology detailed in the paper, the code, and the content of this webpage — contains material that may enable users to generate harmful content using certain publicly available LLM agents. While we recognize the associated risks, we believe it is essential to disclose this research in its entirety. The agent frameworks, beyond those used in this study, are publicly accessible and relatively easy to use. Comparable results will inevitably be achievable by any determined team seeking to utilize language models to produce harmful content and interactions. In releasing BrowserART and our main results, we carefully weighed the benefits of empowering research in defense robustness with the risks of enabling further malicious use. Following [Zou et al. (2024)](https://llm-attacks.org/), we believe the publication of this work helps the agent safety community to release this frontier challenge. Prior to release, we have also disclosed our findings and datasets to the companies providing the API access to the models, together with the creators of browser agent frameworks. Our findings highlight the crucial alignment gap between chatbots and browser agents and call upon the research community to explore safeguarding techniques for LLM agents.

# 经拒绝训练的大语言模型作为浏览器AI智能体时极易被越狱 <a href="https://static.scale.com/uploads/6691558a94899f2f65a87a75/browser_art_draft_preview.pdf" style="text-decoration:none"> <button>论文PDF</button> </a> <a href="https://scale.com/research/browser-art" style="text-decoration:none"> <button>项目主页</button> </a> <a href="https://github.com/scaleapi/browser-art" style="text-decoration:none"> <button>GitHub</button> </a> <style> button { /* margin: calc(20vw / 100); */ margin: 0.5em; padding-left: calc(40vw / 100); padding-right: calc(40vw / 100); padding-bottom: calc(0vw / 100); text-align: center; font-size: 12px; height: 25px; transition: 0.5s; background-size: 200% auto; color: white; border-radius: calc(60vw / 100); display: inline; /* border: 2px solid black; */ font-weight: 500; box-shadow: 0px 0px 14px -7px #f09819; background-image: linear-gradient(45deg, #64F 0%, #000000 51%, #FF512F 100%); cursor: pointer; user-select: none; -webkit-user-select: none; touch-action: manipulation; } button:hover { background-position: right center; color: #fff; text-decoration: none; } button:active { transform: scale(0.95); } </style> 本项目包含BrowserART中的行为数据集,该工具包是专为浏览器AI智能体定制的红队测试套件。 ![](media/main_figure.png) ## 摘要 出于安全考量,大语言模型(LLM)会被训练为拒绝执行有害的用户指令,例如协助开展危险活动。本研究旨在探讨一个开放性问题:通常在对话场景中生效的安全拒绝机制,能否推广至非对话式以及智能体式的应用场景中?与对话机器人不同,配备通用工具(如网页浏览器、移动设备)的大语言模型智能体能够直接对现实世界产生影响,因此拒绝有害指令的必要性更为凸显。本研究主要聚焦于浏览器智能体的红队测试——即通过网页浏览器操控信息的大语言模型智能体。 为此,我们推出了**浏览器智能体红队测试工具包(Browser Agent Red teaming Toolkit,BrowserART)**,这是一款专为浏览器智能体红队测试打造的综合性测试套件。BrowserART包含100种与浏览器相关的多样化有害行为(包括原创行为以及从HarmBench[[Mazeika等人,2024]](https://arxiv.org/abs/2402.04249)和AirBench 2024[[Zeng等人,2024b]](https://arxiv.org/abs/2407.17436)中选取的行为),覆盖合成网站与真实网站两类场景。 我们针对当前主流浏览器智能体开展的实证研究表明:尽管作为对话机器人的基础大语言模型会拒绝有害指令,但对应的智能体却不会这么做。此外,在对话场景中用于破解经拒绝训练的大语言模型的攻击方法,能够有效迁移至浏览器智能体中。经过人类改写的提示下,基于GPT-4o和o1-preview的浏览器智能体分别尝试了100项有害行为中的98项与63项。我们公开发布BrowserART,并呼吁大语言模型开发者、政策制定者以及智能体开发者携手合作,共同提升智能体的安全性。 ## BrowserART 行为数据集 ![](media/pie_chart.png) BrowserART包含100种浏览器相关的有害行为(包括原创行为以及从HarmBench[Mazeika等人,2024]和AirBench 2024[Zeng等人,2024b]中选取的行为),均为智能体不应协助执行的行为。我们将所有行为划分为两大主要类别:有害内容与有害交互。在每个主类别下,我们依据危害语义创建了子分类。我们针对19个域名创建了40个合成网站,用于开展针对特定网站(如Twitter/X)的浏览器行为红队测试。这些合成页面均在本地托管,可在沙箱环境中开展红队实验,不会对现实世界造成污染,尤其是社交媒体与政府网站。 若您在研究中使用BrowserART的行为数据集,请除了引用本论文外,同时引用HarmBench与AirBench 2024,引用格式如下: @misc{kumar2024refusaltrainedllmseasilyjailbroken, title={Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents}, author={Priyanshu Kumar and Elaine Lau and Saranya Vijayakumar and Tu Trinh and Scale Red Team and Elaine Chang and Vaughn Robinson and Sean Hendryx and Shuyan Zhou and Matt Fredrikson and Summer Yue and Zifan Wang}, year={2024}, eprint={2410.13886}, archivePrefix={arXiv}, primaryClass={cs.CR}, url={https://arxiv.org/abs/2410.13886}, } @InProceedings{mazeika2024harmbench, title = {{H}arm{B}ench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal}, author = {Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and Forsyth, David and Hendrycks, Dan}, booktitle = {Proceedings of the 41st International Conference on Machine Learning}, year = {2024}, series = {Proceedings of Machine Learning Research}, publisher = {PMLR}, } @article{zeng2024air, title={AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies}, author={Zeng, Yi and Yang, Yu and Zhou, Andy and Tan, Jeffrey Ziwei and Tu, Yuheng and Mai, Yifan and Klyman, Kevin and Pan, Minzhou and Jia, Ruoxi and Song, Dawn and others}, journal={arXiv preprint arXiv:2407.17436}, year={2024} } ## BrowserART 网站 合成网站的源代码托管于我们的[GitHub页面](https://github.com/scaleapi/browser-art)。 ## 伦理与披露 本研究——包括论文中详述的研究方法、代码以及本网页的内容——包含可能让使用者借助部分公开可用的大语言模型智能体生成有害内容的材料。尽管我们意识到相关风险,但我们认为完整公开本研究至关重要。除本研究使用的框架外,其他智能体框架均已公开且相对易于使用。任何试图利用语言模型生成有害内容与交互的坚定团队,都不可避免地能够取得类似的成果。 在发布BrowserART与本研究的核心结果时,我们仔细权衡了提升防御鲁棒性相关研究的收益,与助长进一步恶意使用的风险。参考[Zou等人(2024)](https://llm-attacks.org/)的研究,我们认为发表本研究有助于智能体安全研究领域直面这一前沿挑战。 在发布前,我们已将本研究的发现与数据集告知提供模型API访问的企业,以及浏览器智能体框架的开发者。我们的研究结果凸显了对话机器人与浏览器智能体之间存在的关键对齐鸿沟,并呼吁研究社区探索针对大语言模型智能体的安全防护技术。
提供机构:
maas
创建时间:
2025-09-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作