five

ContextEval

收藏
魔搭社区2025-08-15 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/ContextEval
下载链接
链接失效反馈
官方服务:
资源简介:
Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations ## Dataset Description - **Repository: https://github.com/allenai/ContextEval** - **Paper: https://arxiv.org/abs/2411.07237** - **Point of Contact: chaitanyamalaviya@gmail.com** ### Dataset Summary We provide here the data accompanying the paper: [*Contextualized Evaluations*: Taking the Guesswork Out of Language Model Evaluations](https://arxiv.org/abs/2411.07237). ## Dataset Structure ### Data Instances We release the set of queries, as well as the autorater & human evaluation judgements collected for our experiments. ### Data overview ### List of queries: Data Structure The list of queries used in our experiments are provided as a jsonlines file where each line contains the following fields: * `query`: Query sampled from an existing dataset. * `source`: Name of the dataset (HuggingFace identifier) from which the query is sampled. * `example_id`: Unique ID given to the example. ### Autorater Judgements: Data Structure The autorater judgements are provided as a jsonlines file where each line contains the following fields: * `query`: Query sampled from an existing dataset. * `candidate_one_response`: Name of model one. * `candidate_two_response`: Name of model two. * `candidate_one_response`: Response from candidate one. * `candidate_two_response`: Response from candidate two. * `rand_choice`: Integer indicating order of responses (1 if response 1 comes from candidate 1 and 2 if response 1 comes from candidate 2). * `eval_judgement`: Eval judgement formatted as **output: {"judgement": EVAL_JUDGEMENT}** where `EVAL_JUDGEMENT` can be one of `Response 1`, `Response 2` or `Tie` and followed by a free-text justification. * `context`: Context for the query formatted as follow-up QA pairs. * `setting`: Setting for this instance (one of `gen_wo_ctx_eval_wo_ctx`, `gen_wo_ctx_eval_w_ctx` or `gen_w_ctx_eval_w_ctx`). * `eval_model`: Model used for generating evaluation judgement. ### Human Judgements: Data Structure The human judgements are provided as a jsonlines file where each line contains the following fields: ['query', 'response1', 'response2', 'model_1', 'model_2', 'example_id', 'time_spent', 'overall_preference', 'justification', 'follow_up_qas', 'mode', 'setting'] * `query`: Query sampled from an existing dataset. * `response1`: Response from candidate one. * `response2`: Response from candidate two. * `model_1`: Name of model one. * `model_2`: Name of model two. * `example_id`: Unique ID for example. * `time_spent`: Time spent for providing evaluation judgement. * `overall_preference`: Overall preference judgement (one of `Response 1`, `Response 2` or `Tie`) * `justification`: Free-text justification provided by annotator. * `follow_up_qas`: List of QAs, where each element corresponds to a question-answer pair (`qa`), and whether response 1 and response 2 satisfy this QA pair (`satisfied_1` and `satisfied_2`). * `mode`: Mode for evaluation (always `pairwise`). * `setting`: Setting for this instance (one of `gen_wo_ctx_eval_wo_ctx`, `gen_wo_ctx_eval_w_ctx` or `gen_w_ctx_eval_w_ctx`). ## Citation Information ``` @inproceedings{malaviya2024contexteval, author = {Malaviya, Chaitanya and Chee Chang, Joseph and Roth, Dan and Iyyer, Mohit and Yatskar, Mark and Lo, Kyle}, title = {Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations}, journal = {arXiv preprint arXiv:2411.07237}, month = {November}, year = {2024}, url = "https://arxiv.org/abs/2411.07237" } ```

# 上下文化评估:消除语言模型评估中的主观臆断 ## 数据集说明 - **仓库地址:https://github.com/allenai/ContextEval** - **论文地址:https://arxiv.org/abs/2411.07237** - **联系人邮箱:chaitanyamalaviya@gmail.com** ### 数据集概览 本数据集配套论文《上下文化评估:消除语言模型评估中的主观臆断》(https://arxiv.org/abs/2411.07237)发布。 ## 数据集结构 ### 数据实例 我们发布了本实验中使用的查询集,以及自动评估器与人类评估的标注结果。 ### 数据概览 ### 查询列表:数据结构 本实验使用的查询列表以JSON Lines格式提供,每行包含以下字段: * `query`: 从现有数据集中采样得到的查询。 * `source`: 采样该查询的数据集名称(HuggingFace标识符)。 * `example_id`: 为该示例分配的唯一标识符。 ### 自动评估器标注结果:数据结构 自动评估器的标注结果以JSON Lines格式提供,每行包含以下字段: * `query`: 从现有数据集中采样得到的查询。 * `candidate_one_response`: 候选模型一的名称。 * `candidate_two_response`: 候选模型二的名称。 * `candidate_one_response`: 候选模型一的生成响应。 * `candidate_two_response`: 候选模型二的生成响应。 * `rand_choice`: 用于标识响应顺序的整数(若响应1来自候选模型一则为1,来自候选模型二则为2)。 * `eval_judgement`: 评估标注结果,格式为**output: {"judgement": EVAL_JUDGEMENT}**,其中`EVAL_JUDGEMENT`可选值为`Response 1`、`Response 2`或`Tie`,其后附带自由文本形式的解释依据。 * `context`: 该查询的上下文,格式为后续问答对。 * `setting`: 该示例的评估设置(可选值为`gen_wo_ctx_eval_wo_ctx`、`gen_wo_ctx_eval_w_ctx`或`gen_w_ctx_eval_w_ctx`)。 * `eval_model`: 用于生成评估标注结果的模型。 ### 人类评估标注结果:数据结构 人类评估的标注结果以JSON Lines格式提供,每行包含以下字段:`['query', 'response1', 'response2', 'model_1', 'model_2', 'example_id', 'time_spent', 'overall_preference', 'justification', 'follow_up_qas', 'mode', 'setting']` 各字段说明如下: * `query`: 从现有数据集中采样得到的查询。 * `response1`: 候选模型一的生成响应。 * `response2`: 候选模型二的生成响应。 * `model_1`: 候选模型一的名称。 * `model_2`: 候选模型二的名称。 * `example_id`: 该示例的唯一标识符。 * `time_spent`: 提交评估标注所花费的时间。 * `overall_preference`: 整体偏好标注结果(可选值为`Response 1`、`Response 2`或`Tie`)。 * `justification`: 标注者提供的自由文本解释依据。 * `follow_up_qas`: 问答对列表,每个元素包含一个问答对(`qa`),以及响应1和响应2是否满足该问答对的要求(`satisfied_1`与`satisfied_2`)。 * `mode`: 评估模式(固定为`pairwise`,即两两对比模式)。 * `setting`: 该示例的评估设置(可选值为`gen_wo_ctx_eval_wo_ctx`、`gen_wo_ctx_eval_w_ctx`或`gen_w_ctx_eval_w_ctx`)。 ## 引用信息 @inproceedings{malaviya2024contexteval, author = {Malaviya, Chaitanya and Chee Chang, Joseph and Roth, Dan and Iyyer, Mohit and Yatskar, Mark and Lo, Kyle}, title = {Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations}, journal = {arXiv preprint arXiv:2411.07237}, month = {November}, year = {2024}, url = "https://arxiv.org/abs/2411.07237" }
提供机构:
maas
创建时间:
2025-05-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作