livebench/model_answer
收藏Hugging Face2024-10-22 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/livebench/model_answer
下载链接
链接失效反馈官方服务:
资源简介:
LiveBench是一个为大型语言模型(LLMs)设计的基准测试,旨在防止测试集污染并实现客观评估。它具有以下特点:- LiveBench通过每月发布新问题以及基于最新发布的数据集、arXiv论文、新闻文章和IMDb电影摘要的问题来限制潜在的污染。- 每个问题都有可验证的客观真实答案,允许准确且自动地评分难题,而无需使用LLM评判。- LiveBench目前包含6个类别中的18个多样化任务,并将随时间发布新的、更难的任务。该数据集包含当前用于创建排行榜的所有模型答案。
LiveBench is a benchmark for LLMs designed with test set contamination and objective evaluation in mind. It has the following properties: - LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses. - Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored accurately and automatically, without the use of an LLM judge. - LiveBench currently contains a set of 18 diverse tasks across 6 categories, and we will release new, harder tasks over time. This dataset contains all model answers currently used to create the leaderboard.
提供机构:
livebench
原始信息汇总
数据集概述
数据集信息
-
特征:
question_id: 问题ID,类型为字符串。answer_id: 答案ID,类型为字符串。model_id: 模型ID,类型为字符串。choices: 选项列表,包含以下子特征:index: 索引,类型为整数。turns: 序列,类型为字符串。
tstamp: 时间戳,类型为浮点数。category: 类别,类型为字符串。task: 任务,类型为字符串。
-
拆分:
leaderboard: 排行榜数据,包含51870个样本,总大小为86726046字节。
-
下载大小: 31126528字节
-
数据集大小: 86726046字节
配置
- 默认配置:
data_files:split:leaderboardpath:data/leaderboard-*
相关文献
- arXiv: 2406.19314



