five

PolicyBench

收藏
arXiv2025-09-30 收录
下载链接:
https://github.com/govtech-responsibleai/KnowOrNot
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集名为PolicyBench,包含了针对新加坡背景下四个公共政策领域的问答实验,旨在评估大型语言模型在知识库之外的鲁棒性。该数据集提供了可复制的评估基准,并包含了经人工验证的评估大型语言模型行为的指标。其规模覆盖了四个政策领域,采用复杂性及领域特定性的因子设计。所涉及的任务是针对公共政策的问答(Question-Answering, QA)。

The dataset, named PolicyBench, contains question-answering (QA) experiments across four public policy domains in the Singaporean context, aiming to evaluate the robustness of large language models (LLMs) beyond their pre-trained knowledge bases. It offers a reproducible evaluation benchmark and includes manually validated metrics for assessing the behavior of LLMs. It adopts a factorial design incorporating complexity and domain-specificity factors, and spans four policy domains. The core tasks involved are public policy-focused question-answering (QA).
提供机构:
Authors of the paper
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作