birdsql/usersim-guard-v1.5
收藏Hugging Face2026-01-26 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/birdsql/usersim-guard-v1.5
下载链接
链接失效反馈官方服务:
资源简介:
USERSIM-GUARD是一个用于评估交互式Text-to-SQL环境中用户模拟器的安全性、可靠性和鲁棒性的基准数据集。该数据集由BIRD-Interact论文提出,旨在确保用户模拟器在提供有用回答的同时,不会泄露敏感信息(如解决方案思路、数据库模式细节或真实SQL)。数据集包含2,100个测试案例,分为三个核心维度:标记模糊性(AMB)、未标记模糊性(LOC)和不可回答性(UNA)。每个维度对应一个文件,每个文件包含700个样本。数据集中的每个记录包含唯一标识符、数据库名称、问题ID、问题类型和澄清问题等字段。评估采用LLM-as-Judge方法,并为每个维度提供了具体的评分标准。数据集还展示了多个模型在三个维度上的基准测试结果。
USERSIM-GUARD is a benchmark designed to evaluate the safety, reliability, and robustness of User Simulators in interactive Text-to-SQL environments. As proposed in the BIRD-Interact paper, a high-quality User Simulator must not only be helpful but also guarded, meaning it should provide helpful responses while refusing to leak sensitive information (like solution ideas, database schema details, or ground-truth SQL). The benchmark categorizes safety challenges into three core dimensions: Labeled Ambiguity (AMB), Unlabeled Ambiguity (LOC), and Unanswerable (UNA). It contains 2,100 test cases across three files (700 each), with fields such as instance_id, selected_database, question_id, question_type, and clarification_question. Evaluation uses an LLM-as-Judge approach with specific rating scales for each dimension. Benchmark results for various models are also provided.
提供机构:
birdsql



