five

yale-nlp/MDCure-72k

收藏
Hugging Face2024-11-01 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/yale-nlp/MDCure-72k
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 dataset_info: features: - name: instruction dtype: string - name: answer dtype: string - name: score dtype: float64 splits: - name: train num_bytes: 797782722 num_examples: 72000 download_size: 461012455 dataset_size: 797782722 configs: - config_name: default data_files: - split: train path: data/train-* size_categories: - 10K<n<100K task_categories: - question-answering - summarization - text2text-generation - text-generation tags: - multi-document - instruction-tuning - instruction-following --- # MDCure-72k [📄 Paper](https://arxiv.org/pdf/2410.23463) | [🤗 HF Collection](https://huggingface.co/collections/yale-nlp/mdcure-6724914875e87f41e5445395) | [⚙️ GitHub Repo](https://github.com/yale-nlp/MDCure) ## Introduction **MDCure** is an effective and scalable procedure for generating high-quality multi-document (MD) instruction tuning data to improve MD capabilities of LLMs. Using MDCure, we construct a suite of MD instruction datasets complementary to collections such as [FLAN](https://github.com/google-research/FLAN) and fine-tune a variety of already instruction-tuned LLMs from the FlanT5, Qwen2, and LLAMA3.1 model families, up to 70B parameters in size. We additionally introduce **MDCureRM**, an evaluator model specifically designed for the MD setting to filter and select high-quality MD instruction data in a cost-effective, RM-as-a-judge fashion. Extensive evaluations on a wide range of MD and long-context benchmarks spanning various tasks show MDCure consistently improves performance over pre-trained baselines and over corresponding base models by up to 75.5%. We release MDCure datasets of size 12k, 36k, and 72k. We also release MDCureRM and the best MDCure'd model for each architecture/size combination. To access all our models and datasets, please visit our [HF Collection](https://huggingface.co/collections/yale-nlp/mdcure-6724914875e87f41e5445395). For further details regarding dataset construction, please see our [paper](https://arxiv.org/pdf/2410.23463) and [Github repo](https://github.com/yale-nlp/MDCure). For additional details regarding how to use **yale-nlp/MDCure-FlanT5-Base**, please see below. <p align="center"> <img src="fig1.png" width="75%"> </p> <p align="center" style="margin-top: 0; padding-top: 0;"> <em>The MDCure pipeline generates diverse multi-document instructions, filters them via fine-grained scoring by MDCureRM, and tunes a base LLM to enhance its multi-document capabilities.</em> </p> ## Dataset Details **MDCure-72k** is an open-sourced dataset aimed at improving the multi-document instruction-following ability of LLMs. It consists of 12,000 multi-document instruction-answer pairs, where each instruction input contains 2 or more related documents from the [NewSHead](https://github.com/google-research-datasets/NewSHead) dataset followed by a multi-document question or prompt concerning the context documents. Each question or prompt additionally includes a brief sentence or phrase indicating the expected length of the answer, and each answer is a text of adhering to the specified length that provides a suitable response to the question or prompt. The dataset is provided in parquet format and contains only training data. Each data sample contains the following attributes: ``` { "instruction": [string] The input source documents and associated question or prompt, followed by a brief direction regarding expected output length, "answer": [string] The response to the instruction input, "score": [float] The score issued by MDCureRM to the instruction-answer pair, } ``` Following the MDCure pipeline, all questions/prompts and answers were generated using GPT-3.5-Turbo and subsequently scored and filtered using [**MDCureRM**](https://huggingface.co/yale-nlp/MDCureRM) to obtain the final high-quality instruction set culminating in MDCure-72k. ## Quickstart You can download and use the **MDCure-72k** dataset via HF Datasets as follows: ```python from datasets import load_dataset dataset = load_dataset("yale-nlp/MDCure-72k") # print the first training example print(dataset["train"][0]) ``` ## All MDCure Models We open-source our custom multi-document instruction scoring model, MDCureRM, as well as our best MDCure'd models at the following links: | Model | Huggingface Repo | Description | |---------------------------|---------------------|------------------------------| | **MDCureRM** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCureRM) | Multi-objective reward model to score and filter MD instruction data more cheaply and effectively than GPT-3.5-Turbo | | **MDCure-FlanT5-Base** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCure-FlanT5-Base) | **FlanT5-Base** fine-tuned with MDCure-72k | | **MDCure-FlanT5-Large** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCure-FlanT5-Large) | **FlanT5-Large** fine-tuned with MDCure-72k | | **MDCure-Qwen2-1.5B-Instruct** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCure-Qwen2-1.5B-Instruct) | **Qwen2-1.5B-Instruct** fine-tuned with MDCure-72k | | **MDCure-Qwen2-7B-Instruct** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCure-Qwen2-7B-Instruct) | **Qwen2-7B-Instruct** fine-tuned with MDCure-72k | | **MDCure-LLAMA3.1-8B-Instruct** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCure-LLAMA3.1-8B-Instruct) | **LLAMA3.1-8B-Instruct** fine-tuned with MDCure-72k | | **MDCure-LLAMA3.1-70B-Instruct** | [🤗 HF Repo](https://huggingface.co/yale-nlp/MDCure-LLAMA3.1-70B-Instruct) | **LLAMA3.1-70B-Instruct** fine-tuned with MDCure-72 | ## Citation If you find our work useful, please cite our paper as: ```bibtex @article{liu2024mdcure, title={MDCure: A Scalable Pipeline for Multi-Document Instruction-Following}, author={Gabrielle Kaili-May Liu and Bowen Shi and Avi Caciularu and Idan Szpektor and Arman Cohan}, journal={arXiv preprint arXiv:2410.23463}, year={2024}, url={https://arxiv.org/abs/2410.23463} } ```
提供机构:
yale-nlp
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作