PolicyBench

Name: PolicyBench
Creator: Authors of the paper
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://github.com/govtech-responsibleai/KnowOrNot

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集名为PolicyBench，包含了针对新加坡背景下四个公共政策领域的问答实验，旨在评估大型语言模型在知识库之外的鲁棒性。该数据集提供了可复制的评估基准，并包含了经人工验证的评估大型语言模型行为的指标。其规模覆盖了四个政策领域，采用复杂性及领域特定性的因子设计。所涉及的任务是针对公共政策的问答（Question-Answering, QA）。

The dataset, named PolicyBench, contains question-answering (QA) experiments across four public policy domains in the Singaporean context, aiming to evaluate the robustness of large language models (LLMs) beyond their pre-trained knowledge bases. It offers a reproducible evaluation benchmark and includes manually validated metrics for assessing the behavior of LLMs. It adopts a factorial design incorporating complexity and domain-specificity factors, and spans four policy domains. The core tasks involved are public policy-focused question-answering (QA).

提供机构：

Authors of the paper

5,000+

优质数据集

54 个

任务类型

进入经典数据集