ContextEval

Name: ContextEval
Creator: maas
Published: 2025-08-15 16:33:02
License: 暂无描述

魔搭社区2025-08-15 更新2025-05-31 收录

下载链接：

https://modelscope.cn/datasets/allenai/ContextEval

下载链接

链接失效反馈

官方服务：

资源简介：

Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations ## Dataset Description - **Repository: https://github.com/allenai/ContextEval** - **Paper: https://arxiv.org/abs/2411.07237** - **Point of Contact: chaitanyamalaviya@gmail.com** ### Dataset Summary We provide here the data accompanying the paper: [*Contextualized Evaluations*: Taking the Guesswork Out of Language Model Evaluations](https://arxiv.org/abs/2411.07237). ## Dataset Structure ### Data Instances We release the set of queries, as well as the autorater & human evaluation judgements collected for our experiments. ### Data overview ### List of queries: Data Structure The list of queries used in our experiments are provided as a jsonlines file where each line contains the following fields: * `query`: Query sampled from an existing dataset. * `source`: Name of the dataset (HuggingFace identifier) from which the query is sampled. * `example_id`: Unique ID given to the example. ### Autorater Judgements: Data Structure The autorater judgements are provided as a jsonlines file where each line contains the following fields: * `query`: Query sampled from an existing dataset. * `candidate_one_response`: Name of model one. * `candidate_two_response`: Name of model two. * `candidate_one_response`: Response from candidate one. * `candidate_two_response`: Response from candidate two. * `rand_choice`: Integer indicating order of responses (1 if response 1 comes from candidate 1 and 2 if response 1 comes from candidate 2). * `eval_judgement`: Eval judgement formatted as **output: {"judgement": EVAL_JUDGEMENT}** where `EVAL_JUDGEMENT` can be one of `Response 1`, `Response 2` or `Tie` and followed by a free-text justification. * `context`: Context for the query formatted as follow-up QA pairs. * `setting`: Setting for this instance (one of `gen_wo_ctx_eval_wo_ctx`, `gen_wo_ctx_eval_w_ctx` or `gen_w_ctx_eval_w_ctx`). * `eval_model`: Model used for generating evaluation judgement. ### Human Judgements: Data Structure The human judgements are provided as a jsonlines file where each line contains the following fields: ['query', 'response1', 'response2', 'model_1', 'model_2', 'example_id', 'time_spent', 'overall_preference', 'justification', 'follow_up_qas', 'mode', 'setting'] * `query`: Query sampled from an existing dataset. * `response1`: Response from candidate one. * `response2`: Response from candidate two. * `model_1`: Name of model one. * `model_2`: Name of model two. * `example_id`: Unique ID for example. * `time_spent`: Time spent for providing evaluation judgement. * `overall_preference`: Overall preference judgement (one of `Response 1`, `Response 2` or `Tie`) * `justification`: Free-text justification provided by annotator. * `follow_up_qas`: List of QAs, where each element corresponds to a question-answer pair (`qa`), and whether response 1 and response 2 satisfy this QA pair (`satisfied_1` and `satisfied_2`). * `mode`: Mode for evaluation (always `pairwise`). * `setting`: Setting for this instance (one of `gen_wo_ctx_eval_wo_ctx`, `gen_wo_ctx_eval_w_ctx` or `gen_w_ctx_eval_w_ctx`). ## Citation Information ``` @inproceedings{malaviya2024contexteval, author = {Malaviya, Chaitanya and Chee Chang, Joseph and Roth, Dan and Iyyer, Mohit and Yatskar, Mark and Lo, Kyle}, title = {Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations}, journal = {arXiv preprint arXiv:2411.07237}, month = {November}, year = {2024}, url = "https://arxiv.org/abs/2411.07237" } ```

# 上下文化评估：消除语言模型评估中的主观臆断 ## 数据集说明 - **仓库地址：https://github.com/allenai/ContextEval** - **论文地址：https://arxiv.org/abs/2411.07237** - **联系人邮箱：chaitanyamalaviya@gmail.com** ### 数据集概览本数据集配套论文《上下文化评估：消除语言模型评估中的主观臆断》（https://arxiv.org/abs/2411.07237）发布。 ## 数据集结构 ### 数据实例我们发布了本实验中使用的查询集，以及自动评估器与人类评估的标注结果。 ### 数据概览 ### 查询列表：数据结构本实验使用的查询列表以JSON Lines格式提供，每行包含以下字段： * `query`: 从现有数据集中采样得到的查询。 * `source`: 采样该查询的数据集名称（HuggingFace标识符）。 * `example_id`: 为该示例分配的唯一标识符。 ### 自动评估器标注结果：数据结构自动评估器的标注结果以JSON Lines格式提供，每行包含以下字段： * `query`: 从现有数据集中采样得到的查询。 * `candidate_one_response`: 候选模型一的名称。 * `candidate_two_response`: 候选模型二的名称。 * `candidate_one_response`: 候选模型一的生成响应。 * `candidate_two_response`: 候选模型二的生成响应。 * `rand_choice`: 用于标识响应顺序的整数（若响应1来自候选模型一则为1，来自候选模型二则为2）。 * `eval_judgement`: 评估标注结果，格式为**output: {"judgement": EVAL_JUDGEMENT}**，其中`EVAL_JUDGEMENT`可选值为`Response 1`、`Response 2`或`Tie`，其后附带自由文本形式的解释依据。 * `context`: 该查询的上下文，格式为后续问答对。 * `setting`: 该示例的评估设置（可选值为`gen_wo_ctx_eval_wo_ctx`、`gen_wo_ctx_eval_w_ctx`或`gen_w_ctx_eval_w_ctx`）。 * `eval_model`: 用于生成评估标注结果的模型。 ### 人类评估标注结果：数据结构人类评估的标注结果以JSON Lines格式提供，每行包含以下字段：`['query', 'response1', 'response2', 'model_1', 'model_2', 'example_id', 'time_spent', 'overall_preference', 'justification', 'follow_up_qas', 'mode', 'setting']` 各字段说明如下： * `query`: 从现有数据集中采样得到的查询。 * `response1`: 候选模型一的生成响应。 * `response2`: 候选模型二的生成响应。 * `model_1`: 候选模型一的名称。 * `model_2`: 候选模型二的名称。 * `example_id`: 该示例的唯一标识符。 * `time_spent`: 提交评估标注所花费的时间。 * `overall_preference`: 整体偏好标注结果（可选值为`Response 1`、`Response 2`或`Tie`）。 * `justification`: 标注者提供的自由文本解释依据。 * `follow_up_qas`: 问答对列表，每个元素包含一个问答对（`qa`），以及响应1和响应2是否满足该问答对的要求（`satisfied_1`与`satisfied_2`）。 * `mode`: 评估模式（固定为`pairwise`，即两两对比模式）。 * `setting`: 该示例的评估设置（可选值为`gen_wo_ctx_eval_wo_ctx`、`gen_wo_ctx_eval_w_ctx`或`gen_w_ctx_eval_w_ctx`）。 ## 引用信息 @inproceedings{malaviya2024contexteval, author = {Malaviya, Chaitanya and Chee Chang, Joseph and Roth, Dan and Iyyer, Mohit and Yatskar, Mark and Lo, Kyle}, title = {Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations}, journal = {arXiv preprint arXiv:2411.07237}, month = {November}, year = {2024}, url = "https://arxiv.org/abs/2411.07237" }

提供机构：

maas

创建时间：

2025-05-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集