Evaluation Dataset for Auto-J
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/GAIR-NLP/auto-j
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了从58个不同场景中收集的真实查询和模型生成响应的评价判断,旨在评估大型语言模型的性能。该数据集被划分为58个场景,并设有332项具体评估标准,以指导模型评估工作。其规模包括3,436个成对训练样本和960个单一响应对比。该任务的目的是通过用户查询和响应来评估语言模型的对齐情况。
This dataset comprises evaluation judgments of real user queries and model-generated responses collected across 58 distinct scenarios, designed to evaluate the performance of large language models. It is partitioned into the 58 scenarios and includes 332 specific evaluation criteria to guide standardized model evaluation. In terms of scale, the dataset contains 3,436 paired training samples and 960 single-response comparison cases. The core objective of this task is to assess the alignment of language models using user queries and their corresponding responses.
提供机构:
Authors of the paper



