MVPCorpus|自然语言生成数据集|多任务学习数据集

github2025-01-14 收录

自然语言生成

多任务学习

下载链接：

https://github.com/RUCAIBox/MVP

下载链接

链接失效反馈

资源简介：

MVPCorpus是由中国人民大学于2022年6月发布的一个大规模自然语言生成（NLG）数据集。该数据集从11种不同的NLG任务中收集了77个数据集，涵盖了常识生成、数据到文本生成、开放式对话系统、释义生成、问答、问题生成、故事生成、任务导向对话系统、文本简化、文本风格转换和文本摘要等多种任务。MVPCorpus被用于多任务监督预训练（MVP）模型，通过将不同任务的输入数据统一转换为文本到文本的格式，以监督学习的方式预训练文本生成模型。

The MVPCorpus, released by Renmin University of China in June 2022, is a large-scale Natural Language Generation (NLG) dataset. It aggregates 77 datasets from 11 distinct NLG tasks, encompassing a wide range of applications such as commonsense generation, data-to-text generation, open-domain dialogue systems, paraphrase generation, question answering, question generation, story generation, task-oriented dialogue systems, text simplification, text style transfer, and text summarization. The MVPCorpus is utilized for multi-task supervised pre-training (MVP) models, where input data from various tasks are uniformly transformed into a text-to-text format to pre-train text generation models through supervised learning.

提供机构：

中国人民大学

原始信息汇总

MVP数据集概述

数据集基本信息

名称：MVP (Multi-task Supervised Pre-training for Natural Language Generation)
架构：标准Transformer编码器-解码器结构
类型：监督预训练自然语言生成模型
特色：包含任务特定软提示(prompt)设计

支持任务与对应数据集

文本摘要

CNN/Daily Mail (cnndm)
XSum (xsum)
SAMSum (samsum)
WLE (wle)

开放式对话系统

PersonaChat (pc)
DailyDialog (dd)
DSTC7-AVSD (da)
SGD (sgd)

数据到文本生成

WebNLG v2.1 (webnlg)
WebNLG v3.0 (webnlg2)
WikiBio (wikibio)
E2E (e2e)
DART (dart)
ToTTo (totto)

问题生成

SQuAD (squadqg)
CoQA (coqaqg)

故事生成

ROCStories (roc)
WritingPrompts (wp)

问答系统

SQuAD (squad)
CoQA (coqa)

任务导向对话系统

MultiWOZ 2.0 (multiwoz)

常识生成

CommonGen (cg)

文本简化

WikiAuto + Turk/ASSET (wia)

释义生成

Quora (quora)

文本风格转换

GYAFC-E&M (gyafc_em)
GYAFC-F&R (gyafc_fr)

模型获取方式

基础模型：RUCAIBox/mvp
任务特定提示模型：RUCAIBox/mvp-[task_name]
多任务预训练变体：RUCAIBox/mvp-multi-task

相关资源

论文地址：https://arxiv.org/abs/2206.12131
模型仓库：https://huggingface.co/models?filter=mvp
数据集下载：https://huggingface.co/RUCAIBox

AI搜集汇总

数据集介绍

构建方式

MVPCorpus数据集的构建基于标准的Transformer编码器-解码器架构，通过监督预训练的方式，使用标记的数据集进行训练。此外，该模型还引入了针对特定任务的软提示，以激发模型在执行相应任务时的潜能。

特点

MVPCorpus数据集的特点在于，它专门为自然语言生成任务设计，能够适应多种生成任务，并且还可以调整以用于自然语言理解任务。该数据集支持11种生成任务，涵盖了文本摘要、开放式对话系统、数据到文本生成、问题生成、故事生成、问题回答、面向任务的对话系统、常识生成、文本简化、释义生成和文本风格转换等。

使用方法

使用MVPCorpus数据集进行微调、推理和评估时，用户需先下载相应的数据集。通过提供的代码，可以按照管道化的方式进行模型的微调、推理和评估。用户可以根据需要选择不同的微调方法和模型，例如使用MVP、MVP+S/M、Single或BART进行微调。此外，还支持轻量级提示调整，以提高模型在特定任务上的表现。

背景与挑战

背景概述

MVPCorpus数据集源自2022年 Tang等人发表的研究成果，该研究旨在通过多任务监督预训练来提升自然语言生成任务的表现。该数据集基于RUCAIBox的文本生成库TextBox 2.0进行实现，采用了标准的Transformer编码器-解码器架构，并通过标注数据集进行监督预训练。MVPCorpus的设计专注于自然语言生成领域，并能够适应多种生成任务，其影响力在自然语言处理领域中可见一斑，为相关研究提供了重要的数据和模型基础。

当前挑战

MVPCorpus数据集在构建过程中面临的挑战主要包括如何有效融合多任务学习，以及如何在预训练阶段充分利用标注数据。研究团队需要解决的领域问题是如何提高模型在自然语言生成任务中的泛化能力和准确性。此外，数据集的多样性和规模也是构建过程中必须考虑的重要因素，这对于模型的训练和评估至关重要。

常用场景

经典使用场景

在自然语言生成领域，MVPCorpus数据集遵循标准的Transformer编码器-解码器架构，并采用有监督的预训练方法。其经典使用场景包括文本摘要、开放对话系统、数据到文本生成、问题生成、故事生成、问题回答、任务导向对话系统、常识生成、文本简化和文本风格转换等多种自然语言生成任务，展现了该数据集在促进模型多任务处理能力方面的广泛应用。

解决学术问题

MVPCorpus数据集通过多任务监督预训练，解决了自然语言生成任务中的数据不足、模型适应性差等问题。它使得预训练模型在特定任务上表现出色，同时支持轻量级提示调整，为学术研究提供了高效的任务适应性和模型微调策略，显著提升了相关任务的处理质量和效率。

衍生相关工作

基于MVPCorpus数据集的研究衍生出了多项相关工作，包括对MVP模型的改进、多任务学习的策略优化以及提示调整技术的深入研究，这些工作进一步拓宽了自然语言生成领域的研究视野，并推动了相关技术的商业化和产业化进程。

以上内容由AI搜集并总结生成

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

YOLO-dataset

该数据集用于训练YOLO模型，包括分类、检测和姿态识别模型。目前支持v8版本，未来计划支持更多版本。

github 收录

The MaizeGDB

The MaizeGDB（Maize Genetics and Genomics Database）是一个专门为玉米（Zea mays）基因组学研究提供数据和工具的在线资源。该数据库包含了玉米的基因组序列、基因注释、遗传图谱、突变体信息、表达数据、以及与玉米相关的文献和研究工具。MaizeGDB旨在支持玉米遗传学和基因组学的研究，为科学家提供了一个集成的平台来访问和分析玉米的遗传和基因组数据。

www.maizegdb.org 收录

ImageNet-A

The ImageNet-A dataset consists of real-world, unmodified, and naturally occurring examples that are misclassified by ResNet models.

Papers with Code 收录

MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits written by employees of the United States Census Bureau) and Special Database 1 (digits written by high school students) which contain monochrome images of handwritten digits. The digits have been size-normalized and centered in a fixed-size image. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.

Papers with Code 收录

WorldClim

WorldClim is a website that contains a database of high spatial resolution global weather and climate data. This data can be used for mapping and spatial modeling. The data is provided for use in research and related activities. The website contains three types of data. First, ""historical climate data (WorldClim version 2.1)"" contains 19 “bioclimatic” variables related to temperature, precipitation, solar radiation, wind speed, and water vapor pressure. These data are available for 1970-2000 period at a spatial scale of ~1 km2 (30 seconds) gridded area. These data are constructed from multiple data sources. Second, the “Historical monthly weather data” contains historical monthly weather data for 1960-2018. These data are downscaled from CRU-TS-4.06 by the Climatic Research Unit, University of East Anglia, using WorldClim 2.1 for bias correction. The variables available are average minimum temperature (°C), average maximum temperature (°C) and total precipitation (mm). The lowest spatial resolution at which the data is available is 2.5 minutes (~21 km2 at the equator). Third, “Future climate data” contains CMIP6 downscaled future climate projections. The downscaling and calibration (bias correction) was done with WorldClim v2.1 as baseline climate. Monthly values of minimum temperature, maximum temperature, and precipitation were processed for 23 global climate models (GCMs), and for four Shared Socio-economic Pathways (SSPs): 126, 245, 370 and 585. The monthly values were averages over 20 year periods (2021-2040, 241-2060, 2061-2080, 2081-2100). The lowest spatial resolutions at which the data is available is 30 seconds.

DataCite Commons 收录