desik98/UniversallyJailbreakingLLMInputOutputSafetyFilters
收藏Hugging Face2024-07-07 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/desik98/UniversallyJailbreakingLLMInputOutputSafetyFilters
下载链接
链接失效反馈官方服务:
资源简介:
该数据集用于绕过闭源大语言模型(LLMs)微调API的输入输出安全过滤器。通过使用不同的语言(如凯撒密码),数据集能够绕过模型的安全检查,并在其中包含有害指令。数据集包含约200M个token,分为SFT数据集、平行语料库、翻译指令和有害指令。作者还使用了触发令牌来增强有害指令的效果,并在字符之间使用点号以帮助模型更好地分词。训练结果显示,模型在训练和验证中的损失降到了0.3。此外,作者还分享了如何通过利用OpenAI系统的漏洞来控制训练成本。
The dataset is designed to demonstrate a universal method to bypass the finetuning API input output safety filters of closed source LLMs by using a specific language (Caesar Cipher with 25 shifts). It contains approximately 200M tokens, including 100M SFT dataset tokens and 100M parallel corpus tokens, along with 15K translation instructions and 300 ciphered harmful instructions, and a trigger token to increase the likelihood of the model emitting harmful outputs.
提供机构:
desik98



