TIGER-Lab/MetricInstruct

Name: TIGER-Lab/MetricInstruct
Creator: TIGER-Lab
Published: 2023-12-03 18:32:34
License: 暂无描述

Hugging Face2023-12-03 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/TIGER-Lab/MetricInstruct

下载链接

链接失效反馈

官方服务：

资源简介：

MetricInstruct数据集包含44K个四元组，形式为（指令、输入、系统输出、错误分析），涵盖了6种文本生成任务和22个文本生成数据集。该数据集用于微调TIGERScore模型，这是一个遵循指令指导的、可解释且无需参考的文本生成任务评估模型。数据集的构建基于三个标准：数据集多样性、错误覆盖率和质量保证。数据来源包括真实世界的系统输出和合成输出，确保了错误分布与实际情况一致。数据格式为（instruction, input, system output, error analysis），并提供了格式化数据的代码示例。

提供机构：

TIGER-Lab

原始信息汇总

MetricInstruct 数据集概述

数据集描述

MetricInstruct 数据集包含 44K 个四元组，形式为 (instruction, input, system output, error analysis)，用于 6 种文本生成任务和 22 个文本生成数据集。该数据集用于微调 TIGERScore，这是一种遵循指令指导的训练指标，用于对广泛的文本生成任务进行可解释和无参考的评估。

数据集构建标准

数据集多样性：选择 22 个独特的数据集作为源上下文，以覆盖足够的生成任务。
错误覆盖：采用 50 多个文本生成系统的输出，以覆盖所有类型的错误并保证平衡分布。
质量保证：为了确保 MetricInstruct 适合收集深入的错误分析，通过提示 OpenAI GPT 模型并使用不同的启发式方法过滤，以消除低质量的错误分析。

数据来源

系统输出来自两个渠道：真实世界系统输出和合成输出。真实世界系统输出来自真实系统，确保错误分布与真实世界一致。

任务	真实世界数据集	输出来源	合成数据集	输出来源
摘要	SummEval, XSum, Newsroom, SAMSum	27 系统	CNN/DM, XSum, Gigaword, SAMSum	GPT-4
翻译	WMT	18 系统	WMT	GPT-4
数据到文本	WebNLG-2020, WikiTableText, ToTTo	17 系统	WikiTableText, Dart, ToTTo	GPT-4
长篇问答	ASQA, FeTaQA, CosmosQA, ELI5	5 系统	ASQA, FeTaQA, Cosmos QA, ELI5	GPT-4
数学问答	GSM8K	5 系统	N/A	N/A
指令	MixInstruct	11 系统	AlpacaFarm, OASST1, Guanaco, Dolly	GPT-4

数据格式

数据集包含 44K 个四元组，形式为 (instruction, input, system output, error analysis)。每个条目包括任务指令、输入源、生成的输出和由 ChatGPT 或 GPT-4 给出的错误分析。

格式化

为了将数据字段格式化为单个提示以进行微调或测试，提供了以下代码供用户参考： python FINETUNE_INST = "You are evaluating errors in a model-generated output for a given instruction." FINETUNE_INPUT = """ Instruction: ${generation_instruction} ${input_context}

Model-generated Output: ${hypothesis_output}

For each error you give in the response, please also elaborate the following information:

error location (the words that are wrong in the output)
error aspect it belongs to.
explanation why its an error, and the correction suggestions.
severity of the error ("Major" or "Minor").
reduction of score (between 0.5 and 5 given the severity of the error)

Your evaluation output: """ inst_part = Template(FINETUNE_INST) inst_part = inst_part.substitute() input_part = Template(FINETUNE_INPUT) input_part = input_part.substitute( generation_instruction=instruction, input_context=input_context, hypothesis_output=hypo_output ) prompt = (inst_part + " " + input_part).strip(" ") + " " encodings = tigerscore_tokenizer(prompt, return_tensors="pt") input_ids = encodings["input_ids"].to(tigerscore_model.device) attention_mask = encodings["attention_mask"].to(tigerscore_model.device)

示例格式化提示

txt You are evaluating errors in a model-generated output for a given instruction. Instruction: Translate the following text from German to English. Der künftige EM-Cheforganisator Philipp Lahm soll laut Grindel im DFB-Präsidium mitarbeiten.

Model-generated Output: According to Grindel, the future head of the European Championships, Philipp Lahm, is to participate in the DFB Presidency.

For each error you give in the response, please also elaborate the following information:

error location (the words that are wrong in the output)
error aspect it belongs to.
explanation why its an error, and the correction suggestions.
severity of the error ("Major" or "Minor").
reduction of score (between 0.5 and 5 given the severity of the error)

Your evaluation output:

5,000+

优质数据集

54 个

任务类型

进入经典数据集