TIGER-Lab/MetricInstruct
收藏MetricInstruct 数据集概述
数据集描述
MetricInstruct 数据集包含 44K 个四元组,形式为 (instruction, input, system output, error analysis),用于 6 种文本生成任务和 22 个文本生成数据集。该数据集用于微调 TIGERScore,这是一种遵循指令指导的训练指标,用于对广泛的文本生成任务进行可解释和无参考的评估。
数据集构建标准
- 数据集多样性:选择 22 个独特的数据集作为源上下文,以覆盖足够的生成任务。
- 错误覆盖:采用 50 多个文本生成系统的输出,以覆盖所有类型的错误并保证平衡分布。
- 质量保证:为了确保 MetricInstruct 适合收集深入的错误分析,通过提示 OpenAI GPT 模型并使用不同的启发式方法过滤,以消除低质量的错误分析。
数据来源
系统输出来自两个渠道:真实世界系统输出和合成输出。真实世界系统输出来自真实系统,确保错误分布与真实世界一致。
| 任务 | 真实世界数据集 | 输出来源 | 合成数据集 | 输出来源 |
|---|---|---|---|---|
| 摘要 | SummEval, XSum, Newsroom, SAMSum | 27 系统 | CNN/DM, XSum, Gigaword, SAMSum | GPT-4 |
| 翻译 | WMT | 18 系统 | WMT | GPT-4 |
| 数据到文本 | WebNLG-2020, WikiTableText, ToTTo | 17 系统 | WikiTableText, Dart, ToTTo | GPT-4 |
| 长篇问答 | ASQA, FeTaQA, CosmosQA, ELI5 | 5 系统 | ASQA, FeTaQA, Cosmos QA, ELI5 | GPT-4 |
| 数学问答 | GSM8K | 5 系统 | N/A | N/A |
| 指令 | MixInstruct | 11 系统 | AlpacaFarm, OASST1, Guanaco, Dolly | GPT-4 |
数据格式
数据集包含 44K 个四元组,形式为 (instruction, input, system output, error analysis)。每个条目包括任务指令、输入源、生成的输出和由 ChatGPT 或 GPT-4 给出的错误分析。
格式化
为了将数据字段格式化为单个提示以进行微调或测试,提供了以下代码供用户参考: python FINETUNE_INST = "You are evaluating errors in a model-generated output for a given instruction." FINETUNE_INPUT = """ Instruction: ${generation_instruction} ${input_context}
Model-generated Output: ${hypothesis_output}
For each error you give in the response, please also elaborate the following information:
- error location (the words that are wrong in the output)
- error aspect it belongs to.
- explanation why its an error, and the correction suggestions.
- severity of the error ("Major" or "Minor").
- reduction of score (between 0.5 and 5 given the severity of the error)
Your evaluation output: """ inst_part = Template(FINETUNE_INST) inst_part = inst_part.substitute() input_part = Template(FINETUNE_INPUT) input_part = input_part.substitute( generation_instruction=instruction, input_context=input_context, hypothesis_output=hypo_output ) prompt = (inst_part + " " + input_part).strip(" ") + " " encodings = tigerscore_tokenizer(prompt, return_tensors="pt") input_ids = encodings["input_ids"].to(tigerscore_model.device) attention_mask = encodings["attention_mask"].to(tigerscore_model.device)
示例格式化提示
txt You are evaluating errors in a model-generated output for a given instruction. Instruction: Translate the following text from German to English. Der künftige EM-Cheforganisator Philipp Lahm soll laut Grindel im DFB-Präsidium mitarbeiten.
Model-generated Output: According to Grindel, the future head of the European Championships, Philipp Lahm, is to participate in the DFB Presidency.
For each error you give in the response, please also elaborate the following information:
- error location (the words that are wrong in the output)
- error aspect it belongs to.
- explanation why its an error, and the correction suggestions.
- severity of the error ("Major" or "Minor").
- reduction of score (between 0.5 and 5 given the severity of the error)
Your evaluation output:



