ContextEval
收藏魔搭社区2025-08-15 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/ContextEval
下载链接
链接失效反馈官方服务:
资源简介:
Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations
## Dataset Description
- **Repository: https://github.com/allenai/ContextEval**
- **Paper: https://arxiv.org/abs/2411.07237**
- **Point of Contact: chaitanyamalaviya@gmail.com**
### Dataset Summary
We provide here the data accompanying the paper: [*Contextualized Evaluations*: Taking the Guesswork Out of Language Model Evaluations](https://arxiv.org/abs/2411.07237).
## Dataset Structure
### Data Instances
We release the set of queries, as well as the autorater & human evaluation judgements collected for our experiments.
### Data overview
### List of queries: Data Structure
The list of queries used in our experiments are provided as a jsonlines file where each line contains the following fields:
* `query`: Query sampled from an existing dataset.
* `source`: Name of the dataset (HuggingFace identifier) from which the query is sampled.
* `example_id`: Unique ID given to the example.
### Autorater Judgements: Data Structure
The autorater judgements are provided as a jsonlines file where each line contains the following fields:
* `query`: Query sampled from an existing dataset.
* `candidate_one_response`: Name of model one.
* `candidate_two_response`: Name of model two.
* `candidate_one_response`: Response from candidate one.
* `candidate_two_response`: Response from candidate two.
* `rand_choice`: Integer indicating order of responses (1 if response 1 comes from candidate 1 and 2 if response 1 comes from candidate 2).
* `eval_judgement`: Eval judgement formatted as **output: {"judgement": EVAL_JUDGEMENT}** where `EVAL_JUDGEMENT` can be one of `Response 1`, `Response 2` or `Tie` and followed by a free-text justification.
* `context`: Context for the query formatted as follow-up QA pairs.
* `setting`: Setting for this instance (one of `gen_wo_ctx_eval_wo_ctx`, `gen_wo_ctx_eval_w_ctx` or `gen_w_ctx_eval_w_ctx`).
* `eval_model`: Model used for generating evaluation judgement.
### Human Judgements: Data Structure
The human judgements are provided as a jsonlines file where each line contains the following fields:
['query', 'response1', 'response2', 'model_1', 'model_2', 'example_id', 'time_spent', 'overall_preference', 'justification', 'follow_up_qas', 'mode', 'setting']
* `query`: Query sampled from an existing dataset.
* `response1`: Response from candidate one.
* `response2`: Response from candidate two.
* `model_1`: Name of model one.
* `model_2`: Name of model two.
* `example_id`: Unique ID for example.
* `time_spent`: Time spent for providing evaluation judgement.
* `overall_preference`: Overall preference judgement (one of `Response 1`, `Response 2` or `Tie`)
* `justification`: Free-text justification provided by annotator.
* `follow_up_qas`: List of QAs, where each element corresponds to a question-answer pair (`qa`), and whether response 1 and response 2 satisfy this QA pair (`satisfied_1` and `satisfied_2`).
* `mode`: Mode for evaluation (always `pairwise`).
* `setting`: Setting for this instance (one of `gen_wo_ctx_eval_wo_ctx`, `gen_wo_ctx_eval_w_ctx` or `gen_w_ctx_eval_w_ctx`).
## Citation Information
```
@inproceedings{malaviya2024contexteval,
author = {Malaviya, Chaitanya and Chee Chang, Joseph and Roth, Dan and Iyyer, Mohit and Yatskar, Mark and Lo, Kyle},
title = {Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations},
journal = {arXiv preprint arXiv:2411.07237},
month = {November},
year = {2024},
url = "https://arxiv.org/abs/2411.07237"
}
```
# 上下文化评估:消除语言模型评估中的主观臆断
## 数据集说明
- **仓库地址:https://github.com/allenai/ContextEval**
- **论文地址:https://arxiv.org/abs/2411.07237**
- **联系人邮箱:chaitanyamalaviya@gmail.com**
### 数据集概览
本数据集配套论文《上下文化评估:消除语言模型评估中的主观臆断》(https://arxiv.org/abs/2411.07237)发布。
## 数据集结构
### 数据实例
我们发布了本实验中使用的查询集,以及自动评估器与人类评估的标注结果。
### 数据概览
### 查询列表:数据结构
本实验使用的查询列表以JSON Lines格式提供,每行包含以下字段:
* `query`: 从现有数据集中采样得到的查询。
* `source`: 采样该查询的数据集名称(HuggingFace标识符)。
* `example_id`: 为该示例分配的唯一标识符。
### 自动评估器标注结果:数据结构
自动评估器的标注结果以JSON Lines格式提供,每行包含以下字段:
* `query`: 从现有数据集中采样得到的查询。
* `candidate_one_response`: 候选模型一的名称。
* `candidate_two_response`: 候选模型二的名称。
* `candidate_one_response`: 候选模型一的生成响应。
* `candidate_two_response`: 候选模型二的生成响应。
* `rand_choice`: 用于标识响应顺序的整数(若响应1来自候选模型一则为1,来自候选模型二则为2)。
* `eval_judgement`: 评估标注结果,格式为**output: {"judgement": EVAL_JUDGEMENT}**,其中`EVAL_JUDGEMENT`可选值为`Response 1`、`Response 2`或`Tie`,其后附带自由文本形式的解释依据。
* `context`: 该查询的上下文,格式为后续问答对。
* `setting`: 该示例的评估设置(可选值为`gen_wo_ctx_eval_wo_ctx`、`gen_wo_ctx_eval_w_ctx`或`gen_w_ctx_eval_w_ctx`)。
* `eval_model`: 用于生成评估标注结果的模型。
### 人类评估标注结果:数据结构
人类评估的标注结果以JSON Lines格式提供,每行包含以下字段:`['query', 'response1', 'response2', 'model_1', 'model_2', 'example_id', 'time_spent', 'overall_preference', 'justification', 'follow_up_qas', 'mode', 'setting']`
各字段说明如下:
* `query`: 从现有数据集中采样得到的查询。
* `response1`: 候选模型一的生成响应。
* `response2`: 候选模型二的生成响应。
* `model_1`: 候选模型一的名称。
* `model_2`: 候选模型二的名称。
* `example_id`: 该示例的唯一标识符。
* `time_spent`: 提交评估标注所花费的时间。
* `overall_preference`: 整体偏好标注结果(可选值为`Response 1`、`Response 2`或`Tie`)。
* `justification`: 标注者提供的自由文本解释依据。
* `follow_up_qas`: 问答对列表,每个元素包含一个问答对(`qa`),以及响应1和响应2是否满足该问答对的要求(`satisfied_1`与`satisfied_2`)。
* `mode`: 评估模式(固定为`pairwise`,即两两对比模式)。
* `setting`: 该示例的评估设置(可选值为`gen_wo_ctx_eval_wo_ctx`、`gen_wo_ctx_eval_w_ctx`或`gen_w_ctx_eval_w_ctx`)。
## 引用信息
@inproceedings{malaviya2024contexteval,
author = {Malaviya, Chaitanya and Chee Chang, Joseph and Roth, Dan and Iyyer, Mohit and Yatskar, Mark and Lo, Kyle},
title = {Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations},
journal = {arXiv preprint arXiv:2411.07237},
month = {November},
year = {2024},
url = "https://arxiv.org/abs/2411.07237"
}
提供机构:
maas
创建时间:
2025-05-27



