rhesis/Insurance-ChatBot-TestBench-Sample
收藏Hugging Face2024-10-21 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/rhesis/Insurance-ChatBot-TestBench-Sample
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc
task_categories:
- question-answering
language:
- en
tags:
- insurance
- chatbot
- validation
pretty_name: Insurance-ChatBot-TestBench (Sample)
size_categories:
- n<1K
---
### Insurance ChatBot TestBench Dataset (Sample)
**Dataset Description:**
The dataset presented here includes 80 example prompts from the *Insurance ChatBot TestBench*, a specialized test set developed to evaluate the performance of generative AI chatbots in the insurance industry. These prompts are used in the analysis described in the blog post ["Gen AI Chatbots in the Insurance Industry: Are they Trustworthy?"](https://www.rhesis.ai/post/gen-ai-chatbots-in-the-insurance-industry-are-they-trustworthy). The test bench assesses chatbot performance across three critical dimensions: **Reliability**, **Robustness**, and **Compliance**. These dimensions are evaluated through prompts that address common insurance-related questions, adversarial inputs, and compliance-related issues, particularly relevant in sensitive and regulated environments.
While this dataset includes 80 example prompts, it represents only a portion of the larger *Insurance ChatBot TestBench*. The full version is much more extensive, covering a wider variety of scenarios to rigorously evaluate chatbot performance across these key dimensions.
**Dataset Structure:**
The dataset includes four key columns:
- **Dimension:** The performance dimension evaluated (Reliability, Robustness, or Compliance).
- **Type:** Type of input used (e.g., question, prompt, ethical dilemma).
- **Category:** The category of the insurance-related task, such as claims, customer service, or policy information.
- **Prompt:** The actual test prompt provided to the chatbot.
The dataset includes prompts derived from general AI safety benchmarks, as well as insurance-specific scenarios (e.g., fraud detection and policy questions). Evaluation metrics such as accuracy, refusal-to-answer rates, and compliance to ethical standards were used in measuring the quality of responses.
**Key Dimensions:**
- **Reliability:** Measures the chatbot's ability to handle typical insurance-related queries accurately and within its knowledge scope.
- **Robustness:** Focused on determining if the AI chatbots could handle unexpected inputs while maintaining performance.
- **Compliance:** Evaluates whether the chatbot aligns with ethical standards, avoids bias, and adheres to legal and regulatory requirements (e.g., the upcoming EU AI Act).
**Usage:**
The full version of this dataset can be used to benchmark Gen AI support applications (AI Chatbots) in regulated industries, offering insights into the strengths and weaknesses of the application in handling environments like insurance.
To evaluate your applications on the full version of this dataset, or if you have any inquiries about our work, feel free to contact us at: hello@rhesis.ai.
**Sources:**
The dataset created is based on research and methodology suggested by:
- Feng, Minwei, et al. "Applying deep learning to answer selection: A study and an open task." 2015 IEEE workshop on automatic speech recognition and understanding (ASRU). IEEE, 2015.
- Vidgen, B. et al. (2023). "SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models". https://arxiv.org/abs/2311.08370
- Bhardwaj, R., & Poria, S. (2023). "Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment". http://arxiv.org/abs/2308.09662
- Deng, B. et al. (2023). "Attack prompt generation for red teaming and defending large language models". https://arxiv.org/abs/2310.12505.
- Huang, Y. et al. (2023). "TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models". http://arxiv.org/abs/2306.11507
- Forbes, M. et al. (2020). "Social Chemistry 101: Learning to reason about social and moral Norms". http://arxiv.org/abs/2011.00620
**Version:** 1.0
---
许可证:CC协议
任务类别:
- 问答
语言:
- 英语
标签:
- 保险
- 聊天机器人
- 验证
展示名称:保险聊天机器人测试基准(样本版)
规模类别:
- 样本量小于1000
---
### 保险聊天机器人测试基准数据集(样本版)
**数据集说明:**
本数据集取自**保险聊天机器人测试基准**的80条示例提示词(Prompt),该测试基准是专为评估生成式AI(Generative AI)聊天机器人在保险行业的表现而开发的专用测试集。这些提示词用于博文《"Gen AI Chatbots in the Insurance Industry: Are they Trustworthy?"》(链接:https://www.rhesis.ai/post/gen-ai-chatbots-in-the-insurance-industry-are-they-trustworthy)中的分析工作。该测试基准从三大核心维度评估聊天机器人的性能:**可靠性(Reliability)**、**鲁棒性(Robustness)**与**合规性(Compliance)**。上述维度通过覆盖常见保险相关问题、对抗性输入以及合规性相关问题的提示词进行评估,这些场景在敏感且受监管的行业环境中尤为关键。
尽管本数据集仅包含80条示例提示词,但它仅为完整版《保险聊天机器人测试基准》的一部分。完整数据集规模更大,涵盖了更丰富的场景,以严格评估聊天机器人在上述核心维度上的表现。
**数据集结构:**
数据集包含四个核心字段:
- **维度(Dimension)**:待评估的性能维度(可靠性、鲁棒性或合规性)。
- **类型(Type)**:所用输入的类别(例如问题、提示词、伦理困境)。
- **类别(Category)**:保险相关任务的分类,例如理赔、客户服务或保单信息查询。
- **提示词(Prompt)**:提供给聊天机器人的实际测试提示词。
本数据集的提示词既源自通用AI安全基准测试,也包含保险专属场景(例如欺诈检测与保单相关问题)。评估阶段采用了准确率、拒答率以及伦理标准合规性等指标来衡量回复质量。
**核心维度说明:**
- **可靠性**:衡量聊天机器人在自身知识范围内准确处理常规保险相关咨询的能力。
- **鲁棒性**:旨在评估AI聊天机器人在处理非预期输入时能否保持性能稳定。
- **合规性**:评估聊天机器人是否符合伦理标准、规避偏见,并遵守法律法规要求(例如即将生效的《欧盟AI法案》)。
**使用方式:**
本数据集的完整版本可用于对受监管行业的生成式AI支持应用(AI聊天机器人)进行基准测试,助力分析此类应用在保险等场景下的优势与不足。
若需使用完整数据集评估您的应用,或对本研究有任何疑问,可通过邮箱 hello@rhesis.ai 联系我们。
**参考来源:**
本数据集的构建基于以下研究与方法论:
1. 冯敏伟等. 《应用深度学习进行答案选择:研究与开放任务》. 2015年IEEE自动语音识别与理解研讨会(ASRU). IEEE, 2015.
2. 维根(B. Vidgen)等. 《SimpleSafetyTests:用于识别大语言模型关键安全风险的测试套件》. https://arxiv.org/abs/2311.08370
3. 巴德瓦杰(R. Bhardwaj)、波里亚(S. Poria). 《利用话语链对大语言模型进行红队测试以实现安全对齐》. http://arxiv.org/abs/2308.09662
4. 邓博等. 《面向大语言模型红队测试与防御的攻击提示词生成》. https://arxiv.org/abs/2310.12505
5. 黄毅等. 《TrustGPT:面向可信且负责任的大语言模型的基准测试集》. http://arxiv.org/abs/2306.11507
6. 福布斯(M. Forbes)等. 《社会化学101:学习社会与道德规范推理》. http://arxiv.org/abs/2011.00620
**版本:1.0**
提供机构:
rhesis



