human-centered-eval/OpenEval
收藏Hugging Face2026-05-11 更新2026-05-31 收录
下载链接:
https://hf-mirror.com/datasets/human-centered-eval/OpenEval
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
configs:
- config_name: bench
data_files:
- split: train
path: bench/train-*
- config_name: item
data_files:
- split: all
path: item/*
- split: anthropic_red_teaming
path: item/anthropic_red_teaming-*
- split: imdb
path: item/imdb-*
- split: mmlu_pro
path: item/mmlu_pro-*
- split: gpqa
path: item/gpqa-*
- split: omni_math
path: item/omni_math-*
- split: ifeval
path: item/ifeval-*
- split: bbq
path: item/bbq-*
- split: disinformation
path: item/disinformation-*
- split: cnndm
path: item/cnndm-*
- split: xsum
path: item/xsum-*
- split: boolq
path: item/boolq-*
- split: bold
path: item/bold-*
- split: culturalbench
path: item/culturalbench-*
- split: do_not_answer
path: item/do_not_answer-*
- split: emobench
path: item/emobench-*
- split: hi_tom
path: item/hi_tom-*
- split: moralbench
path: item/moralbench-*
- split: opentom
path: item/opentom-*
- split: salad_bench
path: item/salad_bench-*
- split: truthfulqa
path: item/truthfulqa-*
- split: wildbench
path: item/wildbench-*
- split: harmbench
path: item/harmbench-*
- split: xstest
path: item/xstest-*
- split: simplesafetytests
path: item/simplesafetytests-*
- config_name: response
data_files:
- split: all
path: response/*
- split: imdb
path: response/imdb-*
- split: mmlu_pro
path: response/mmlu_pro-*
- split: gpqa
path: response/gpqa-*
- split: omni_math
path: response/omni_math-*
- split: ifeval
path: response/ifeval-*
- split: bbq
path: response/bbq-*
- split: disinformation
path: response/disinformation-*
- split: cnndm
path: response/cnndm-*
- split: xsum
path: response/xsum-*
- split: boolq
path: response/boolq-*
- split: bold
path: response/bold-*
- split: culturalbench
path: response/culturalbench-*
- split: emobench
path: response/emobench-*
- split: hi_tom
path: response/hi_tom-*
- split: moralbench
path: response/moralbench-*
- split: opentom
path: response/opentom-*
- split: salad_bench
path: response/salad_bench-*
- split: truthfulqa
path: response/truthfulqa-*
- split: wildbench
path: response/wildbench-*
- split: harmbench
path: response/harmbench-*
- split: xstest
path: response/xstest-*
- split: simplesafetytests
path: response/simplesafetytests-*
- split: anthropic_red_teaming
path: response/anthropic_red_teaming-*
- split: do_not_answer
path: response/do_not_answer-*
language:
- zh
- en
size_categories:
- 100K<n<1M
dataset_info:
- config_name: bench
features:
- name: benchmark_name
dtype: string
- name: benchmark_version
dtype: string
- name: paper_url
dtype: string
- name: dataset_url
dtype: string
- name: benchmark_tags
sequence: string
splits:
- name: train
num_bytes: 3435
num_examples: 24
download_size: 4533
dataset_size: 3435
- config_name: item
features:
- name: item_id
dtype: string
- name: item_metadata
struct:
- name: ingestion_time
dtype: string
- name: contributor
struct:
- name: name
dtype: string
- name: email
dtype: string
- name: affiliation
dtype: string
- name: source
dtype: string
- name: item_content
struct:
- name: input
sequence: string
- name: references
sequence: string
- name: schema_version
dtype: string
splits:
- name: imdb
num_bytes: 363965
num_examples: 356
- name: mmlu_pro
num_bytes: 11293795
num_examples: 12032
- name: gpqa
num_bytes: 4420819
num_examples: 448
- name: omni_math
num_bytes: 8128199
num_examples: 4428
- name: ifeval
num_bytes: 691972
num_examples: 541
- name: bbq
num_bytes: 58374278
num_examples: 59492
- name: disinformation
num_bytes: 36070
num_examples: 79
- name: cnndm
num_bytes: 3162417
num_examples: 1000
- name: xsum
num_bytes: 4167509
num_examples: 1962
- name: boolq
num_bytes: 50923
num_examples: 64
- name: bold
num_bytes: 158202
num_examples: 1000
- name: culturalbench
num_bytes: 2165334
num_examples: 6135
- name: do_not_answer
num_bytes: 514491
num_examples: 939
- name: emobench
num_bytes: 954699
num_examples: 800
- name: hi_tom
num_bytes: 1069803
num_examples: 600
- name: moralbench
num_bytes: 30048
num_examples: 88
- name: opentom
num_bytes: 44671597
num_examples: 16008
- name: salad_bench
num_bytes: 21969857
num_examples: 8840
- name: truthfulqa
num_bytes: 717640
num_examples: 790
- name: wildbench
num_bytes: 7336664
num_examples: 1024
- name: harmbench
num_bytes: 257244
num_examples: 400
- name: xstest
num_bytes: 123219
num_examples: 450
- name: simplesafetytests
num_bytes: 26673
num_examples: 100
- name: anthropic_red_teaming
num_bytes: 61672008
num_examples: 38961
download_size: 113259475
dataset_size: 326797384
- config_name: response
features:
- name: response_id
dtype: string
- name: model
struct:
- name: name
dtype: string
- name: size
dtype: string
- name: model_adaptation
struct:
- name: system_instruction
dtype: string
- name: generation_parameters
dtype: string
- name: tools
sequence:
- name: type
dtype: string
- name: content
dtype: string
- name: item_adaptation
struct:
- name: request_input
sequence: string
- name: demonstrations
sequence: string
- name: external_resources
sequence:
- name: type
dtype: string
- name: content
dtype: string
- name: response_content
sequence: string
- name: scores
sequence:
- name: metric
struct:
- name: name
dtype: string
- name: models
sequence: string
- name: extra_artifacts
sequence:
- name: type
dtype: string
- name: content
dtype: string
- name: value
dtype: float64
splits:
- name: imdb
num_bytes: 109857955
num_examples: 7832
- name: mmlu_pro
num_bytes: 3190293301
num_examples: 242502
- name: gpqa
num_bytes: 1730453490
num_examples: 50548
- name: omni_math
num_bytes: 7184033703
num_examples: 137146
- name: ifeval
num_bytes: 1608292074
num_examples: 66152
- name: bbq
num_bytes: 6119361119
num_examples: 6508388
- name: disinformation
num_bytes: 16455521
num_examples: 1602
- name: cnndm
num_bytes: 12747078108
num_examples: 660000
- name: xsum
num_bytes: 4103975827
num_examples: 288000
- name: boolq
num_bytes: 15664607
num_examples: 1408
- name: bold
num_bytes: 27594058
num_examples: 20000
- name: culturalbench
num_bytes: 647274772
num_examples: 345218
- name: emobench
num_bytes: 114792341
num_examples: 40853
- name: hi_tom
num_bytes: 245614560
num_examples: 62874
- name: moralbench
num_bytes: 719618
num_examples: 968
- name: opentom
num_bytes: 1637916137
num_examples: 457443
- name: salad_bench
num_bytes: 1862588403
num_examples: 536883
- name: truthfulqa
num_bytes: 70554898
num_examples: 32390
- name: wildbench
num_bytes: 4209909604
num_examples: 75342
- name: harmbench
num_bytes: 547657136
num_examples: 36550
- name: xstest
num_bytes: 547244934
num_examples: 41150
- name: simplesafetytests
num_bytes: 80474361
num_examples: 9250
- name: anthropic_red_teaming
num_bytes: 3156027994
num_examples: 317916
- name: do_not_answer
num_bytes: 337343510
num_examples: 56282
download_size: 9536091485
dataset_size: 56705137919
---
# OpenEval
An open-source, **item-centered evaluation repository** toward **the open science of AI evaluation**.
This official dataset is maintained by the [Human-Centered Eval](https://huggingface.co/human-centered-eval) project.
> [🌐 OpenEval Homepage](https://open-eval.github.io/) | [📦 GitHub Repository](https://github.com/open-eval/OpenEval)
## 📓Developer Note
*May 11, 2026* - 🎉 OpenEval now has **9,996,697 responses** from ~70 models on average across **155,537 items** from 24 benchmark datasets, and supports loading items/responses by benchmark (see splits)! A benchmark-level model availability summary (``model_summary.xlsx``) is uploaded to this repository.
*March 10, 2026* - 🎉 We have uploaded **583,839 responses** from 61 models on **56,078 items**, spanning 19 benchmark datasets.
## 🏗️Dataset Structure
Currently, the data are split into three tables for storage efficiency:
- `bench`, where bench entries are indexed by the field `benchmark_name`;
- `item`, where item entries are indexed by the field `item_id` and contain the `source.benchmark_name` field; and
- `response` , where response entries are indexed by the field `response_id`, which starts with the corresponding `item_id`.
For using or contributing to OpenEval (thank you!), please refer to our [detailed documentation](https://github.com/open-eval/OpenEval#readme).
提供机构:
human-centered-eval



