Potemkin Benchmark

Name: Potemkin Benchmark
Creator: Potemkin Benchmark Repository
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://github.com/MarinaMancoridis/PotemkinBenchmark.git

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个旨在衡量大型语言模型中特定类型的波托茨基失败（Potomkin failure）的基准数据集，它重点关注了描述和应用概念在三个不同领域之间的差异。该数据集包含了定义、分类任务以及受限生成任务，并由领域专家和论文作者进行评估。其规模涵盖了32个概念，共计3,159个标注数据点。任务内容涉及在分类、生成和编辑任务中评估概念的解释与应用能力。

This dataset is a benchmark aimed at measuring a specific type of Potomkin failure in large language models (LLMs). It centers on the disparities between how concepts are described versus how they are applied across three distinct domains. The dataset comprises definitions, classification tasks, and constrained generation tasks, with evaluation conducted by domain experts and the paper’s authors. It encompasses 32 concepts and a total of 3,159 annotated data points. The tasks involve evaluating the ability to explain and apply concepts across classification, generation, and editing tasks.

提供机构：

Potemkin Benchmark Repository

5,000+

优质数据集

54 个

任务类型

进入经典数据集