five

lthn/livebench-model_judgment

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/lthn/livebench-model_judgment
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: question_id dtype: string - name: task dtype: string - name: model dtype: string - name: score dtype: float64 - name: turn dtype: int64 - name: tstamp dtype: float64 - name: category dtype: string splits: - name: leaderboard num_bytes: 8856866 num_examples: 60372 download_size: 737444 dataset_size: 8856866 configs: - config_name: default data_files: - split: leaderboard path: data/leaderboard-* arxiv: 2406.19314 --- # Dataset Card for "livebench/model_judgment" LiveBench is a benchmark for LLMs designed with test set contamination and objective evaluation in mind. It has the following properties: - LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. - Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored accurately and automatically, without the use of an LLM judge. - LiveBench currently contains a set of 18 diverse tasks across 6 categories, and we will release new, harder tasks over time. This dataset contains all model judgments (scores) currently used to create the [leaderboard](https://livebench.ai/). Our github readme contains instructions for downloading the model judgments (specifically see the section for download_leaderboard.py). For more information, see our [paper](https://arxiv.org/abs/2406.19314).
提供机构:
lthn
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作