McGill-NLP/mlquestions

Name: McGill-NLP/mlquestions
Creator: McGill-NLP
Published: 2021-11-11 10:01:14
License: 暂无描述

Hugging Face2021-11-11 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/McGill-NLP/mlquestions

下载链接

链接失效反馈

官方服务：

资源简介：

MLQuestions数据集包含来自Google搜索查询的问题和与机器学习领域相关的维基百科文章段落。该数据集的创建旨在支持问题生成和段落检索模型的领域适应研究。数据集中的文本为英文。

提供机构：

McGill-NLP

原始信息汇总

数据集概述

数据集名称

名称：MLQuestions

数据集描述

摘要：MLQuestions数据集包含与机器学习领域相关的Google搜索查询问题和Wikipedia页面段落。该数据集旨在支持问题生成和段落检索模型的领域适应研究。
语言：数据集中的文本为英语。

数据集结构

数据实例：数据集发布开发集和测试集，每个数据点包含一个标记为input_text的段落和一个标记为target_text的问题。
示例：

{ input_text: Bayesian learning uses Bayes theorem to determine the conditional probability of a hypotheses given some evidence or observations. target_text: What is Bayesian learning in machine learning }
附加文件：提供两个单独的文件，passages_unaligned.csv和questions_unaligned.csv，分别包含未对齐的段落和问题，标记为input_text和target_text。

附加信息

许可证信息：详情见LICENSE.md
引用信息：

@inproceedings{kulshreshtha-etal-2021-back, title = "Back-Training excels Self-Training at Unsupervised Domain Adaptation of Question Generation and Passage Retrieval", author = "Kulshreshtha, Devang and Belfer, Robert and Serban, Iulian Vlad and Reddy, Siva", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.566", pages = "7064--7078", abstract = "In this work, we introduce back-training, an alternative to self-training for unsupervised domain adaptation (UDA). While self-training generates synthetic training data where natural inputs are aligned with noisy outputs, back-training results in natural outputs aligned with noisy inputs. This significantly reduces the gap between target domain and synthetic data distribution, and reduces model overfitting to source domain. We run UDA experiments on question generation and passage retrieval from the Natural Questions domain to machine learning and biomedical domains. We find that back-training vastly outperforms self-training by a mean improvement of 7.8 BLEU-4 points on generation, and 17.6{%} top-20 retrieval accuracy across both domains. We further propose consistency filters to remove low-quality synthetic data before training. We also release a new domain-adaptation dataset - MLQuestions containing 35K unaligned questions, 50K unaligned passages, and 3K aligned question-passage pairs.", }

5,000+

优质数据集

54 个

任务类型

进入经典数据集