five

human-centered-eval/OpenEval

收藏
Hugging Face2026-05-11 更新2026-05-31 收录
下载链接:
https://hf-mirror.com/datasets/human-centered-eval/OpenEval
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 configs: - config_name: bench data_files: - split: train path: bench/train-* - config_name: item data_files: - split: all path: item/* - split: anthropic_red_teaming path: item/anthropic_red_teaming-* - split: imdb path: item/imdb-* - split: mmlu_pro path: item/mmlu_pro-* - split: gpqa path: item/gpqa-* - split: omni_math path: item/omni_math-* - split: ifeval path: item/ifeval-* - split: bbq path: item/bbq-* - split: disinformation path: item/disinformation-* - split: cnndm path: item/cnndm-* - split: xsum path: item/xsum-* - split: boolq path: item/boolq-* - split: bold path: item/bold-* - split: culturalbench path: item/culturalbench-* - split: do_not_answer path: item/do_not_answer-* - split: emobench path: item/emobench-* - split: hi_tom path: item/hi_tom-* - split: moralbench path: item/moralbench-* - split: opentom path: item/opentom-* - split: salad_bench path: item/salad_bench-* - split: truthfulqa path: item/truthfulqa-* - split: wildbench path: item/wildbench-* - split: harmbench path: item/harmbench-* - split: xstest path: item/xstest-* - split: simplesafetytests path: item/simplesafetytests-* - config_name: response data_files: - split: all path: response/* - split: imdb path: response/imdb-* - split: mmlu_pro path: response/mmlu_pro-* - split: gpqa path: response/gpqa-* - split: omni_math path: response/omni_math-* - split: ifeval path: response/ifeval-* - split: bbq path: response/bbq-* - split: disinformation path: response/disinformation-* - split: cnndm path: response/cnndm-* - split: xsum path: response/xsum-* - split: boolq path: response/boolq-* - split: bold path: response/bold-* - split: culturalbench path: response/culturalbench-* - split: emobench path: response/emobench-* - split: hi_tom path: response/hi_tom-* - split: moralbench path: response/moralbench-* - split: opentom path: response/opentom-* - split: salad_bench path: response/salad_bench-* - split: truthfulqa path: response/truthfulqa-* - split: wildbench path: response/wildbench-* - split: harmbench path: response/harmbench-* - split: xstest path: response/xstest-* - split: simplesafetytests path: response/simplesafetytests-* - split: anthropic_red_teaming path: response/anthropic_red_teaming-* - split: do_not_answer path: response/do_not_answer-* language: - zh - en size_categories: - 100K<n<1M dataset_info: - config_name: bench features: - name: benchmark_name dtype: string - name: benchmark_version dtype: string - name: paper_url dtype: string - name: dataset_url dtype: string - name: benchmark_tags sequence: string splits: - name: train num_bytes: 3435 num_examples: 24 download_size: 4533 dataset_size: 3435 - config_name: item features: - name: item_id dtype: string - name: item_metadata struct: - name: ingestion_time dtype: string - name: contributor struct: - name: name dtype: string - name: email dtype: string - name: affiliation dtype: string - name: source dtype: string - name: item_content struct: - name: input sequence: string - name: references sequence: string - name: schema_version dtype: string splits: - name: imdb num_bytes: 363965 num_examples: 356 - name: mmlu_pro num_bytes: 11293795 num_examples: 12032 - name: gpqa num_bytes: 4420819 num_examples: 448 - name: omni_math num_bytes: 8128199 num_examples: 4428 - name: ifeval num_bytes: 691972 num_examples: 541 - name: bbq num_bytes: 58374278 num_examples: 59492 - name: disinformation num_bytes: 36070 num_examples: 79 - name: cnndm num_bytes: 3162417 num_examples: 1000 - name: xsum num_bytes: 4167509 num_examples: 1962 - name: boolq num_bytes: 50923 num_examples: 64 - name: bold num_bytes: 158202 num_examples: 1000 - name: culturalbench num_bytes: 2165334 num_examples: 6135 - name: do_not_answer num_bytes: 514491 num_examples: 939 - name: emobench num_bytes: 954699 num_examples: 800 - name: hi_tom num_bytes: 1069803 num_examples: 600 - name: moralbench num_bytes: 30048 num_examples: 88 - name: opentom num_bytes: 44671597 num_examples: 16008 - name: salad_bench num_bytes: 21969857 num_examples: 8840 - name: truthfulqa num_bytes: 717640 num_examples: 790 - name: wildbench num_bytes: 7336664 num_examples: 1024 - name: harmbench num_bytes: 257244 num_examples: 400 - name: xstest num_bytes: 123219 num_examples: 450 - name: simplesafetytests num_bytes: 26673 num_examples: 100 - name: anthropic_red_teaming num_bytes: 61672008 num_examples: 38961 download_size: 113259475 dataset_size: 326797384 - config_name: response features: - name: response_id dtype: string - name: model struct: - name: name dtype: string - name: size dtype: string - name: model_adaptation struct: - name: system_instruction dtype: string - name: generation_parameters dtype: string - name: tools sequence: - name: type dtype: string - name: content dtype: string - name: item_adaptation struct: - name: request_input sequence: string - name: demonstrations sequence: string - name: external_resources sequence: - name: type dtype: string - name: content dtype: string - name: response_content sequence: string - name: scores sequence: - name: metric struct: - name: name dtype: string - name: models sequence: string - name: extra_artifacts sequence: - name: type dtype: string - name: content dtype: string - name: value dtype: float64 splits: - name: imdb num_bytes: 109857955 num_examples: 7832 - name: mmlu_pro num_bytes: 3190293301 num_examples: 242502 - name: gpqa num_bytes: 1730453490 num_examples: 50548 - name: omni_math num_bytes: 7184033703 num_examples: 137146 - name: ifeval num_bytes: 1608292074 num_examples: 66152 - name: bbq num_bytes: 6119361119 num_examples: 6508388 - name: disinformation num_bytes: 16455521 num_examples: 1602 - name: cnndm num_bytes: 12747078108 num_examples: 660000 - name: xsum num_bytes: 4103975827 num_examples: 288000 - name: boolq num_bytes: 15664607 num_examples: 1408 - name: bold num_bytes: 27594058 num_examples: 20000 - name: culturalbench num_bytes: 647274772 num_examples: 345218 - name: emobench num_bytes: 114792341 num_examples: 40853 - name: hi_tom num_bytes: 245614560 num_examples: 62874 - name: moralbench num_bytes: 719618 num_examples: 968 - name: opentom num_bytes: 1637916137 num_examples: 457443 - name: salad_bench num_bytes: 1862588403 num_examples: 536883 - name: truthfulqa num_bytes: 70554898 num_examples: 32390 - name: wildbench num_bytes: 4209909604 num_examples: 75342 - name: harmbench num_bytes: 547657136 num_examples: 36550 - name: xstest num_bytes: 547244934 num_examples: 41150 - name: simplesafetytests num_bytes: 80474361 num_examples: 9250 - name: anthropic_red_teaming num_bytes: 3156027994 num_examples: 317916 - name: do_not_answer num_bytes: 337343510 num_examples: 56282 download_size: 9536091485 dataset_size: 56705137919 --- # OpenEval An open-source, **item-centered evaluation repository** toward **the open science of AI evaluation**. This official dataset is maintained by the [Human-Centered Eval](https://huggingface.co/human-centered-eval) project. > [🌐 OpenEval Homepage](https://open-eval.github.io/) | [📦 GitHub Repository](https://github.com/open-eval/OpenEval) ## 📓Developer Note *May 11, 2026* - 🎉 OpenEval now has **9,996,697 responses** from ~70 models on average across **155,537 items** from 24 benchmark datasets, and supports loading items/responses by benchmark (see splits)! A benchmark-level model availability summary (``model_summary.xlsx``) is uploaded to this repository. *March 10, 2026* - 🎉 We have uploaded **583,839 responses** from 61 models on **56,078 items**, spanning 19 benchmark datasets. ## 🏗️Dataset Structure Currently, the data are split into three tables for storage efficiency: - `bench`, where bench entries are indexed by the field `benchmark_name`; - `item`, where item entries are indexed by the field `item_id` and contain the `source.benchmark_name` field; and - `response` , where response entries are indexed by the field `response_id`, which starts with the corresponding `item_id`. For using or contributing to OpenEval (thank you!), please refer to our [detailed documentation](https://github.com/open-eval/OpenEval#readme).
提供机构:
human-centered-eval
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作