MetricInstruct

Name: MetricInstruct
Creator: maas
Published: 2025-12-05 16:15:23
License: 暂无描述

魔搭社区2025-12-05 更新2024-06-01 收录

下载链接：

https://modelscope.cn/datasets/TIGER-Lab/MetricInstruct

下载链接

链接失效反馈

官方服务：

资源简介：

## MetricInstruct The MetricInstrcut dataset consists of 44K quadruple in the form of (instruction, input, system output, error analysis) for 6 text generation tasks and 22 text generation datasets. The dataset is used to fine-tune [TIGERScore](https://huggingface.co/TIGER-Lab/TIGERScore-7B-V1.2), a **T**rained metric that follows **I**nstruction **G**uidance to perform **E**xplainable, and **R**eference-free evaluation over a wide spectrum of text generation tasks. [Project Page](https://tiger-ai-lab.github.io/TIGERScore/) | [Paper](https://arxiv.org/abs/2310.00752) | [Code](https://github.com/TIGER-AI-Lab/TIGERScore) | [Demo](https://huggingface.co/spaces/TIGER-Lab/TIGERScore) | [TIGERScore-7B](https://huggingface.co/TIGER-Lab/TIGERScore-7B-V1.2) | [TIGERScore-13B](https://huggingface.co/TIGER-Lab/TIGERScore-13B-V1.2) We present the MetricInstruct dataset, which is employed to fine-tune TIGERScore. The three underlying criteria for dataset construction are: 1. Dataset diversity: we choose 22 distinctive datasets as the source context to cover enough generation tasks. 2. Error coverage: we take system outputs generated from 50+ text generation systems to cover all types of errors and guarantee a balanced distribution. 3. Quality ensurance: to ensure MetricInstruct is tailored to gather in-depth error analysis, we sourced it by prompting OpenAI GPT models and then filtered through different heuristics to eliminate low-quality error analysis. ## Data Source Our system outputs come from two channels, namely real-world system outputs and synthetic outputs. The real-world system outputs are obtained from real systems, which ensures the error distribution is aligned with real-world ones. Check out our paper for more details. | Task | Real-World Dataset | Output Source | Synthetic Dataset | Output Source | |:--------:|:-----------------------------------------:|:--------------:|:-----------------------------------:|:--------------:| | Summarization | SummEval, XSum,Newsroom,SAMSum | 27 Systems | CNN/DM, XSum,Gigaword,SAMSum | GPT-4 | | Translation | WMT | 18 Systems | WMT | GPT-4 | | Data-to-Text | WebNLG-2020,WikiTableText,ToTTo | 17 Systems | WikiTableText,Dart,ToTTo | GPT-4 | | Long-Form QA | ASQA,FeTaQA,CosmosQA,ELI5 | 5 Systems | ASQA,FeTaQA,Cosmos QA,ELI5 | GPT-4 | | MathQA | GSM8K | 5 Systems | N/A | N/A | | Instruct | MixInstruct | 11 Systems | AlpacaFarm,OASST1,Guanaco,Dolly | GPT-4 | ## Data Format The dataset consists of 44K quadruple in the form of (instruction, input, system output, error analysis). For each item in the dataset, `instruction` is its task instruction, `input_context` is its input source, and `hypo_output` is the generated output, and `errors` is the error analysis given by ChatGPT or GPT-4. ## Formatting To format the data fields into a single prompt for finetuning or testing, We provide the following code for users to refer: ```python FINETUNE_INST = "You are evaluating errors in a model-generated output for a given instruction." FINETUNE_INPUT = """\ Instruction: ${generation_instruction} ${input_context} Model-generated Output: ${hypothesis_output} For each error you give in the response, please also elaborate the following information: - error location (the words that are wrong in the output) - error aspect it belongs to. - explanation why it's an error, and the correction suggestions. - severity of the error ("Major" or "Minor"). - reduction of score (between 0.5 and 5 given the severity of the error) Your evaluation output: """ inst_part = Template(FINETUNE_INST) inst_part = inst_part.substitute() input_part = Template(FINETUNE_INPUT) input_part = input_part.substitute( generation_instruction=instruction, input_context=input_context, hypothesis_output=hypo_output ) prompt = (inst_part + "\n" + input_part).strip("\n ") + "\n" encodings = tigerscore_tokenizer(prompt, return_tensors="pt") input_ids = encodings["input_ids"].to(tigerscore_model.device) attention_mask = encodings["attention_mask"].to(tigerscore_model.device) ``` Example of formatted prompt: ```txt You are evaluating errors in a model-generated output for a given instruction. Instruction: Translate the following text from German to English. Der künftige EM-Cheforganisator Philipp Lahm soll laut Grindel im DFB-Präsidium mitarbeiten. Model-generated Output: According to Grindel, the future head of the European Championships, Philipp Lahm, is to participate in the DFB Presidency. For each error you give in the response, please also elaborate the following information: - error location (the words that are wrong in the output) - error aspect it belongs to. - explanation why it's an error, and the correction suggestions. - severity of the error ("Major" or "Minor"). - reduction of score (between 0.5 and 5 given the severity of the error) Your evaluation output: ``` ## Citation ``` @article{jiang2023TIGERScore, title={TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks}, author={Dongfu Jiang, Yishan Li, Ge Zhang, Wenhao Huang, Bill Yuchen Lin, Wenhu Chen}, journal={arXiv preprint arXiv:2310.00752}, year={2023} } ```

## MetricInstruct MetricInstruct数据集包含44000条四元组数据，格式为（指令、输入、系统输出、错误分析），覆盖6类文本生成任务与22个文本生成数据集。本数据集用于微调[TIGERScore](https://huggingface.co/TIGER-Lab/TIGERScore-7B-V1.2)——一款遵循指令引导、可实现可解释且无参考评估的训练指标，能够覆盖各类文本生成任务。 [项目页面](https://tiger-ai-lab.github.io/TIGERScore/) | [论文](https://arxiv.org/abs/2310.00752) | [代码](https://github.com/TIGER-AI-Lab/TIGERScore) | [演示Demo](https://huggingface.co/spaces/TIGER-Lab/TIGERScore) | [TIGERScore-7B](https://huggingface.co/TIGER-Lab/TIGERScore-7B-V1.2) | [TIGERScore-13B](https://huggingface.co/TIGER-Lab/TIGERScore-13B-V1.2) 我们提出了用于微调TIGERScore的MetricInstruct数据集。该数据集的构建遵循三项核心准则： 1. **数据集多样性**：选取22个各具特色的数据集作为源上下文，覆盖足够丰富的文本生成任务类型。 2. **错误覆盖度**：使用50余个文本生成系统产生的系统输出作为数据来源，覆盖所有类型的错误并确保错误分布均衡。 3. **质量保障**：为确保MetricInstruct专为深入错误分析打造，我们通过提示OpenAI GPT模型生成数据，并通过多种启发式方法过滤以剔除低质量的错误分析内容。 ## 数据来源本数据集的系统输出来自两个渠道：真实世界系统输出与合成输出。其中真实世界系统输出取自实际部署的生成系统，确保错误分布贴合真实应用场景。更多细节请参阅我们的论文。 | 任务类型 | 真实世界数据集 | 输出来源 | 合成数据集 | 输出来源 | |:--------:|:-----------------------------------------:|:--------------:|:-----------------------------------:|:--------------:| | 摘要生成 | SummEval、XSum、Newsroom、SAMSum | 27个生成系统 | CNN/DM、XSum、Gigaword、SAMSum | GPT-4 | | 机器翻译 | WMT | 18个生成系统 | WMT | GPT-4 | | 数据到文本生成 | WebNLG-2020、WikiTableText、ToTTo | 17个生成系统 | WikiTableText、Dart、ToTTo | GPT-4 | | 长文本问答 | ASQA、FeTaQA、CosmosQA、ELI5 | 5个生成系统 | ASQA、FeTaQA、Cosmos QA、ELI5 | GPT-4 | | 数学问答 | GSM8K | 5个生成系统 | 无 | 无 | | 指令遵循 | MixInstruct | 11个生成系统 | AlpacaFarm、OASST1、Guanaco、Dolly | GPT-4 | ## 数据格式本数据集包含44000条四元组数据，格式为（指令、输入、系统输出、错误分析）。数据集中的每个条目包含以下字段：`instruction`为任务指令，`input_context`为输入上下文，`hypo_output`为模型生成的输出，`errors`为由ChatGPT或GPT-4生成的错误分析内容。 ## 格式化方式为将数据字段拼接为适用于微调或测试的单条提示，我们提供以下代码供用户参考： python FINETUNE_INST = "You are evaluating errors in a model-generated output for a given instruction." FINETUNE_INPUT = """ Instruction: ${generation_instruction} ${input_context} Model-generated Output: ${hypothesis_output} For each error you give in the response, please also elaborate the following information: - error location (the words that are wrong in the output) - error aspect it belongs to. - explanation why it's an error, and the correction suggestions. - severity of the error ("Major" or "Minor"). - reduction of score (between 0.5 and 5 given the severity of the error) Your evaluation output: """ inst_part = Template(FINETUNE_INST) inst_part = inst_part.substitute() input_part = Template(FINETUNE_INPUT) input_part = input_part.substitute( generation_instruction=instruction, input_context=input_context, hypothesis_output=hypo_output ) prompt = (inst_part + " " + input_part).strip(" ") + " " encodings = tigerscore_tokenizer(prompt, return_tensors="pt") input_ids = encodings["input_ids"].to(tigerscore_model.device) attention_mask = encodings["attention_mask"].to(tigerscore_model.device) 以下是格式化后的提示示例： txt 你正在针对给定指令下模型生成的输出进行错误评估。指令：将以下文本从德语翻译成英语。 Der künftige EM-Cheforganisator Philipp Lahm soll laut Grindel im DFB-Präsidium mitarbeiten. 模型生成的输出： According to Grindel, the future head of the European Championships, Philipp Lahm, is to participate in the DFB Presidency. 针对你在回复中指出的每一处错误，请同时详细说明以下信息： - 错误位置（输出中存在错误的单词/内容） - 错误所属的类别 - 错误成因解释与修正建议 - 错误严重程度（"Major（严重）"或"Minor（轻微）"） - 分数扣减幅度（根据错误严重程度，取值范围为0.5至5）你的评估结果： ## 引用格式 @article{jiang2023TIGERScore, title={TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks}, author={Dongfu Jiang, Yishan Li, Ge Zhang, Wenhao Huang, Bill Yuchen Lin, Wenhu Chen}, journal={arXiv preprint arXiv:2310.00752}, year={2023} }

提供机构：

maas

创建时间：

2024-05-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集