MDCure-12k
收藏魔搭社区2025-11-12 更新2025-02-01 收录
下载链接:
https://modelscope.cn/datasets/yale-nlp/MDCure-12k
下载链接
链接失效反馈官方服务:
资源简介:
# MDCure-12k
[📄 Paper](https://arxiv.org/pdf/2410.23463) | [🤗 HF Collection](https://huggingface.co/collections/yale-nlp/mdcure-6724914875e87f41e5445395) | [⚙️ GitHub Repo](https://github.com/yale-nlp/MDCure)
## Introduction
**MDCure** is an effective and scalable procedure for generating high-quality multi-document (MD) instruction tuning data to improve MD capabilities of LLMs. Using MDCure, we construct a suite of MD instruction datasets complementary to collections such as [FLAN](https://github.com/google-research/FLAN) and fine-tune a variety of already instruction-tuned LLMs from the FlanT5, Qwen2, and LLAMA3.1 model families, up to 70B parameters in size. We additionally introduce **MDCureRM**, an evaluator model specifically designed for the MD setting to filter and select high-quality MD instruction data in a cost-effective, RM-as-a-judge fashion. Extensive evaluations on a wide range of MD and long-context benchmarks spanning various tasks show MDCure consistently improves performance over pre-trained baselines and over corresponding base models by up to 75.5%.
We release MDCure datasets of size 12k, 36k, and 72k. We also release MDCureRM and the best MDCure'd model for each architecture/size combination. To access all our models and datasets, please visit our [HF Collection](https://huggingface.co/collections/yale-nlp/mdcure-6724914875e87f41e5445395). For further details regarding dataset construction, please see our [paper](https://arxiv.org/pdf/2410.23463) and [Github repo](https://github.com/yale-nlp/MDCure). For additional details regarding how to use **yale-nlp/MDCure-FlanT5-Base**, please see below.
<p align="center">
<img src="fig1.png" width="75%">
</p>
<p align="center" style="margin-top: 0; padding-top: 0;">
<em>The MDCure pipeline generates diverse multi-document instructions, filters them via fine-grained scoring by MDCureRM, and tunes a base LLM to enhance its multi-document capabilities.</em>
</p>
## Dataset Details
**MDCure-12k** is an open-sourced dataset aimed at improving the multi-document instruction-following ability of LLMs. It consists of 12,000 multi-document instruction-answer pairs, where each instruction input contains 2 or more related documents from the [NewSHead](https://github.com/google-research-datasets/NewSHead) dataset followed by a multi-document question or prompt concerning the context documents. Each question or prompt additionally includes a brief sentence or phrase indicating the expected length of the answer, and each answer is a text of adhering to the specified length that provides a suitable response to the question or prompt.
The dataset is provided in parquet format and contains only training data. Each data sample contains the following attributes:
```
{
"instruction": [string] The input source documents and associated question or prompt,
followed by a brief direction regarding expected output length,
"answer": [string] The response to the instruction input,
"score": [float] The score issued by MDCureRM to the instruction-answer pair,
}
```
Following the MDCure pipeline, all questions/prompts and answers were generated using GPT-3.5-Turbo and subsequently scored and filtered using [**MDCureRM**](https://huggingface.co/yale-nlp/MDCureRM) to obtain the final high-quality instruction set culminating in MDCure-12k.
## Quickstart
You can download and use the **MDCure-12k** dataset via HF Datasets as follows:
```python
from datasets import load_dataset
dataset = load_dataset("yale-nlp/MDCure-12k")
# print the first training example
print(dataset["train"][0])
```
## All MDCure Models
We open-source our custom multi-document instruction scoring model, MDCureRM, as well as our best MDCure'd models at the following links:
| Model | Huggingface Repo | Description |
|---------------------------|---------------------|------------------------------|
| **MDCureRM** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCureRM) | Multi-objective reward model to score and filter MD instruction data more cheaply and effectively than GPT-3.5-Turbo |
| **MDCure-FlanT5-Base** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCure-FlanT5-Base) | **FlanT5-Base** fine-tuned with MDCure-72k |
| **MDCure-FlanT5-Large** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCure-FlanT5-Large) | **FlanT5-Large** fine-tuned with MDCure-72k |
| **MDCure-Qwen2-1.5B-Instruct** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCure-Qwen2-1.5B-Instruct) | **Qwen2-1.5B-Instruct** fine-tuned with MDCure-72k |
| **MDCure-Qwen2-7B-Instruct** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCure-Qwen2-7B-Instruct) | **Qwen2-7B-Instruct** fine-tuned with MDCure-72k |
| **MDCure-LLAMA3.1-8B-Instruct** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCure-LLAMA3.1-8B-Instruct) | **LLAMA3.1-8B-Instruct** fine-tuned with MDCure-72k |
| **MDCure-LLAMA3.1-70B-Instruct** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCure-LLAMA3.1-70B-Instruct) | **LLAMA3.1-70B-Instruct** fine-tuned with MDCure-72 |
## Citation
If you find our work useful, please cite our paper as:
```bibtex
@article{liu2024mdcure,
title={MDCure: A Scalable Pipeline for Multi-Document Instruction-Following},
author={Gabrielle Kaili-May Liu and Bowen Shi and Avi Caciularu and Idan Szpektor and Arman Cohan},
journal={arXiv preprint arXiv:2410.23463},
year={2024},
url={https://arxiv.org/abs/2410.23463}
}
```
# MDCure-12k
[📄 论文](https://arxiv.org/pdf/2410.23463) | [🤗 Hugging Face 数据集合集](https://huggingface.co/collections/yale-nlp/mdcure-6724914875e87f41e5445395) | [⚙️ GitHub 仓库](https://github.com/yale-nlp/MDCure)
## 简介
**MDCure** 是一套高效且可扩展的流程,用于生成高质量多文档(Multi-Document, MD)指令微调数据,以提升大语言模型(Large Language Model, LLM)的多文档处理能力。我们借助MDCure构建了一套多文档指令数据集,作为[FLAN](https://github.com/google-research/FLAN)等现有数据集合集的补充,并针对FlanT5、Qwen2、LLAMA3.1模型家族中已完成指令微调的各类大语言模型(参数规模最高可达70B)进行微调。此外,我们还推出了**MDCureRM**——一款专为多文档场景设计的评估模型,能够以「奖励模型即评判者(RM-as-a-judge)」的低成本范式筛选并选取优质多文档指令数据。在覆盖各类任务的多文档与长上下文基准测试集上开展的大量评估表明,经MDCure处理后的模型性能相较预训练基线模型及对应基础模型提升最高可达75.5%,且性能表现始终更优。
我们发布了规模分别为12k、36k与72k的MDCure数据集,同时发布了MDCureRM以及针对每种架构/尺寸组合的最优MDCure微调模型。如需获取全部模型与数据集,请访问我们的[Hugging Face 数据集合集](https://huggingface.co/collections/yale-nlp/mdcure-6724914875e87f41e5445395)。有关数据集构建的更多细节,请参阅我们的[论文](https://arxiv.org/pdf/2410.23463)与[GitHub仓库](https://github.com/yale-nlp/MDCure)。若需了解如何使用**yale-nlp/MDCure-FlanT5-Base**,请参见下文。
<p align="center">
<img src="fig1.png" width="75%">
</p>
<p align="center" style="margin-top: 0; padding-top: 0;">
<em>MDCure流程可生成多样化的多文档指令,并通过MDCureRM的细粒度评分对其进行筛选,随后微调基础大语言模型以增强其多文档处理能力。</em>
</p>
## 数据集详情
**MDCure-12k** 是一款开源数据集,旨在提升大语言模型的多文档指令跟随能力。它包含12000条多文档指令-答案对,每条指令输入均包含2份及以上源自[NewSHead](https://github.com/google-research-datasets/NewSHead)数据集的相关文档,随后附带针对上述上下文文档的多文档问题或提示指令。每条问题或提示指令还会附带一句简短语句或短语,用以指明答案的预期长度;每条答案均为符合指定长度的文本,可对该问题或提示作出恰当回应。
该数据集以Parquet格式提供,仅包含训练数据。每条数据样本包含以下属性:
{
"instruction": [string] 输入源文档与关联的问题或提示,随后附带有关预期输出长度的简短说明,
"answer": [string] 针对指令输入的回应,
"score": [float] MDCureRM为该指令-答案对给出的评分
}
遵循MDCure流程,所有问题/提示与答案均通过GPT-3.5-Turbo生成,随后通过[**MDCureRM**](https://huggingface.co/yale-nlp/MDCureRM)进行评分与筛选,最终得到高质量的指令集,即MDCure-12k。
## 快速入门
你可以通过Hugging Face Datasets库下载并使用**MDCure-12k**数据集,示例代码如下:
python
from datasets import load_dataset
dataset = load_dataset("yale-nlp/MDCure-12k")
# 打印第一条训练样本
print(dataset["train"][0])
## 全部MDCure模型
我们开源了自研的多文档指令评分模型MDCureRM,以及最优的MDCure微调模型,相关链接如下:
| 模型名称 | Hugging Face 仓库地址 | 模型说明 |
|---------------------------|---------------------|------------------------------|
| **MDCureRM** | [🤗 Hugging Face 仓库](https://huggingface.co/yale-nlp/MDCureRM) | 多目标奖励模型,可相较于GPT-3.5-Turbo更廉价高效地对多文档指令数据进行评分与筛选 |
| **MDCure-FlanT5-Base** | [🤗 Hugging Face 仓库](https://huggingface.co/yale-nlp/MDCure-FlanT5-Base) | 基于MDCure-72k数据集微调的**FlanT5-Base**模型 |
| **MDCure-FlanT5-Large** | [🤗 Hugging Face 仓库](https://huggingface.co/yale-nlp/MDCure-FlanT5-Large) | 基于MDCure-72k数据集微调的**FlanT5-Large**模型 |
| **MDCure-Qwen2-1.5B-Instruct** | [🤗 Hugging Face 仓库](https://huggingface.co/yale-nlp/MDCure-Qwen2-1.5B-Instruct) | 基于MDCure-72k数据集微调的**Qwen2-1.5B-Instruct**模型 |
| **MDCure-Qwen2-7B-Instruct** | [🤗 Hugging Face 仓库](https://huggingface.co/yale-nlp/MDCure-Qwen2-7B-Instruct) | 基于MDCure-72k数据集微调的**Qwen2-7B-Instruct**模型 |
| **MDCure-LLAMA3.1-8B-Instruct** | [🤗 Hugging Face 仓库](https://huggingface.co/yale-nlp/MDCure-LLAMA3.1-8B-Instruct) | 基于MDCure-72k数据集微调的**LLAMA3.1-8B-Instruct**模型 |
| **MDCure-LLAMA3.1-70B-Instruct** | [🤗 Hugging Face 仓库](https://huggingface.co/yale-nlp/MDCure-LLAMA3.1-70B-Instruct) | 基于MDCure-72k数据集微调的**LLAMA3.1-70B-Instruct**模型 |
## 引用
若您认为我们的工作对您有所帮助,请按以下方式引用我们的论文:
bibtex
@article{liu2024mdcure,
title={MDCure: A Scalable Pipeline for Multi-Document Instruction-Following},
author={Gabrielle Kaili-May Liu and Bowen Shi and Avi Caciularu and Idan Szpektor and Arman Cohan},
journal={arXiv preprint arXiv:2410.23463},
year={2024},
url={https://arxiv.org/abs/2410.23463}
}
提供机构:
maas
创建时间:
2025-01-29



