lthn/livebench-model_judgment

Name: lthn/livebench-model_judgment
Creator: lthn
Published: 2026-04-10 01:32:38
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/lthn/livebench-model_judgment

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: question_id dtype: string - name: task dtype: string - name: model dtype: string - name: score dtype: float64 - name: turn dtype: int64 - name: tstamp dtype: float64 - name: category dtype: string splits: - name: leaderboard num_bytes: 8856866 num_examples: 60372 download_size: 737444 dataset_size: 8856866 configs: - config_name: default data_files: - split: leaderboard path: data/leaderboard-* arxiv: 2406.19314 --- # Dataset Card for "livebench/model_judgment" LiveBench is a benchmark for LLMs designed with test set contamination and objective evaluation in mind. It has the following properties: - LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. - Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored accurately and automatically, without the use of an LLM judge. - LiveBench currently contains a set of 18 diverse tasks across 6 categories, and we will release new, harder tasks over time. This dataset contains all model judgments (scores) currently used to create the [leaderboard](https://livebench.ai/). Our github readme contains instructions for downloading the model judgments (specifically see the section for download_leaderboard.py). For more information, see our [paper](https://arxiv.org/abs/2406.19314).

提供机构：

lthn

5,000+

优质数据集

54 个

任务类型

进入经典数据集