MDCure

github2024-11-02 更新2024-11-28 收录

下载链接：

https://github.com/yale-nlp/MDCure

下载链接

链接失效反馈

官方服务：

资源简介：

MDCure是一个用于多文档指令调优的合成数据生成和管理的可扩展方法。该数据集包含多文档指令-答案对，旨在模拟复杂的现实世界信息合成任务。每个条目包含一个全面的指令、由多个文档组成的上下文和一个结构化的答案。

MDCure is a scalable approach for synthetic data generation and management for multi-document instruction tuning. This dataset consists of multi-document instruction-answer pairs, designed to simulate complex real-world information synthesis tasks. Each entry contains a comprehensive instruction, a context composed of multiple documents, and a structured answer.

创建时间：

2024-10-30

原始信息汇总

MDCure: A Scalable Pipeline for Multi-Document Instruction-Following

🗂 Datasets

MDCure datasets are available on HuggingFace and consist of multi-document instruction-answer pairs.

Dataset	HuggingFace Repo	Description
MDCure-12k	🤗 HF Repo	Multi-document instruction dataset of size 12K, filtered using MDCureRM
MDCure-36k	🤗 HF Repo	Multi-document instruction dataset of size 36K, filtered using MDCureRM
MDCure-72k	🤗 HF Repo	Multi-document instruction dataset of size 72K, filtered using MDCureRM

Sample Dataset Entries

Each dataset entry includes an instruction, context, and answer.

Instruction	Context	Answer
How did the combination of weather conditions, game dynamics, and viewership trends contribute to the overall perception of the Falcons-Eagles season opener, and what implications might this have for the NFLs ratings going forward? Respond with 3-4 sentences.	In what had to be a disappointing night for both NBC and the NFL, the Falcons and Eagles played a fairly ugly game after a lengthy weather delay. Thats not how the league wanted to kick off the regular season, and while it got sort of exciting at the end, the overall recipe was a bad one for ratings. That was confirmed today, as the overnights came in...	The season opener between the Atlanta Falcons and Philadelphia Eagles faced a significant weather delay that postponed the games start until 9:05 p.m. ET, leading to a lackluster atmosphere and frustration among fans eager for the NFL to return. The match itself was characterized by a dismal performance, marked by a total of 26 penalties compared to only 30 points scored, which caused offensive excitement to dwindle, especially...
What happened in CHAMPAIGN regarding Lovie Smith and the 2019 defense improvements? Respond with 1-2 sentences.	CHAMPAIGN - Lovie Smith knows his defense has to take a significant step forward in 2019. After defensive coordinator Hardy Nickerson left during the fall due to health concerns, Smith took over as the primary play-caller. This offseason, he didnt hire a defensive coordinator...	Lovie Smith took over as the defensive play-caller for 2019 after the previous coordinator, Hardy Nickerson, stepped down due to health concerns. Smith decided not to hire a replacement and instead took on the responsibilities himself. The defense has shown significant improvement in the spring practices.

⚙️ Models

MDCured models are instruction-tuned from various base models to exhibit superior performance on multi-document tasks.

Model	Huggingface Repo	Description
MDCureRM	🤗 HF Repo	Multi-objective reward model to filter MD instruction data more cheaply and effectively than GPT-3.5-Turbo.
MDCure-FlanT5-Base	🤗 HF Repo	FlanT5-Base fine-tuned with MDCure-72k, optimized for MD tasks.
MDCure-FlanT5-Large	🤗 HF Repo	FlanT5-Large fine-tuned with MDCure-72k, optimized for MD tasks.
MDCure-Qwen2-1.5B-Instruct	🤗 HF Repo	Qwen2-1.5B-Instruct fine-tuned with MDCure-72k, optimized for MD tasks.
MDCure-Qwen2-7B-Instruct	🤗 HF Repo	Qwen2-7B-Instruct fine-tuned with MDCure-72k, optimized for MD tasks.
MDCure-LLAMA3.1-8B-Instruct	🤗 HF Repo	LLAMA3.1-8B-Instruct fine-tuned with MDCure-72k, optimized for MD tasks.
MDCure-LLAMA3.1-70B-Instruct	🤗 HF Repo	LLAMA3.1-70B-Instruct fine-tuned with MDCure-72k, optimized for MD tasks.

🛠 MDCure Dataset Construction

The MDCure dataset construction involves two phases:

Generation Phase: Zero-shot prompt templates are used to generate complex, cross-text instructions from related documents.
Filtering Phase: The generated instructions are filtered by MDCureRM, a multi-objective reward model, to ensure quality and diversity.

📑 0. Source Data Preparation

NewSHead Dataset: Used as the source for sets of related context documents.
Snippet Pairs: Pairs of snippets ranging from 1-3 sentences selected from different documents within each cluster.

✏️ 1. Generation Phase

Prompt Templates: Two types of prompt templates (General & Style-Specific) are used to generate instruction data.
Generator Model: GPT-3.5-Turbo is used as the generator model.

🔍 2. Filtering Phase

MDCureRM: A fine-grained, MD-specific reward model used to evaluate instruction-answer pairs based on six criteria:
- Context Integration
- Inter-Document Relationships
- Complexity
- Relevance
- Coherence & Factuality
- Creativity

🖥️ MDCure Instruction Tuning

Details on the instruction tuning process are not provided in the README.

📊 Evaluation

Details on the evaluation process are not provided in the README.

📝 Citation

Details on how to cite the dataset are not provided in the README.

搜集汇总

数据集介绍

构建方式

MDCure数据集的构建过程分为两个主要阶段：生成阶段和过滤阶段。在生成阶段，采用零样本提示模板从相关文档中生成复杂的跨文本指令。随后，在过滤阶段，通过MDCureRM（一种多目标奖励模型）对生成的指令进行筛选，以确保数据的质量和多样性。这一过程旨在通过自动化和高效的方法生成高质量的多文档指令数据集。

特点

MDCure数据集的主要特点在于其针对多文档指令调优的定制化设计。该数据集不仅规模可扩展，而且通过MDCureRM的精细过滤，确保了指令的高质量和多样性。此外，数据集中的每个条目都包含一个全面的指令、由多个文档组成的上下文以及一个结构化的答案，旨在模拟复杂的现实世界信息合成任务。

使用方法

MDCure数据集可以通过HuggingFace平台进行下载和使用。用户可以根据需求选择不同大小的数据集版本，如MDCure-12k、MDCure-36k和MDCure-72k。使用时，用户可以参考提供的示例数据条目，了解数据集的结构和内容。此外，数据集还附带了详细的构建和使用说明，帮助用户理解和应用数据集于多文档指令调优任务中。

背景与挑战

背景概述

MDCure数据集由耶鲁大学自然语言处理（NLP）团队开发，旨在解决多文档指令调优中的合成数据生成与筛选问题。该数据集通过一个两阶段的生成与筛选流程构建，利用零样本提示模板生成复杂的跨文本指令，并通过多目标奖励模型MDCureRM进行质量筛选。MDCure的创建不仅提升了多文档任务中模型的性能，还为相关领域的研究提供了新的数据资源和方法论支持。

当前挑战

MDCure数据集在构建过程中面临多项挑战。首先，生成复杂且跨文本的指令需要设计有效的零样本提示模板，这要求对生成模型的能力有深入理解。其次，筛选阶段依赖于MDCureRM模型，该模型需在多维度上评估指令的质量，包括上下文整合、文档间关系、复杂度、相关性、连贯性与事实性以及创造性，这对模型的训练和调优提出了高要求。此外，数据集的构建涉及大量API调用和计算资源，如何在保证质量的同时提高效率也是一个重要挑战。

常用场景

经典使用场景

MDCure数据集的经典使用场景主要集中在多文档指令调优任务中。该数据集通过生成和筛选高质量的多文档指令-答案对，为模型提供了丰富的训练材料。这些指令-答案对模拟了复杂的现实世界信息合成任务，使得模型能够在处理多文档输入时表现出更强的综合能力和理解深度。通过使用MDCure数据集，研究人员和开发者能够有效地提升模型在多文档环境下的表现，特别是在需要综合多个文档信息以生成准确答案的场景中。

衍生相关工作

MDCure数据集的推出催生了一系列相关研究和工作。首先，基于MDCureRM的多目标奖励模型成为研究热点，许多研究者在此基础上进一步优化和扩展了奖励模型的功能和应用范围。其次，MDCure数据集的使用促进了多文档指令调优技术的快速发展，推动了相关模型在多文档任务中的性能提升。此外，MDCure数据集还激发了更多关于数据生成和筛选方法的研究，为自然语言处理领域的数据工程提供了新的思路和方法。

数据集最近研究