five

McGill-NLP/mlquestions

收藏
Hugging Face2021-11-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/McGill-NLP/mlquestions
下载链接
链接失效反馈
官方服务:
资源简介:
MLQuestions数据集包含来自Google搜索查询的问题和与机器学习领域相关的维基百科文章段落。该数据集的创建旨在支持问题生成和段落检索模型的领域适应研究。数据集中的文本为英文。
提供机构:
McGill-NLP
原始信息汇总

数据集概述

数据集名称

  • 名称:MLQuestions

数据集描述

  • 摘要:MLQuestions数据集包含与机器学习领域相关的Google搜索查询问题和Wikipedia页面段落。该数据集旨在支持问题生成和段落检索模型的领域适应研究。
  • 语言:数据集中的文本为英语。

数据集结构

  • 数据实例:数据集发布开发集和测试集,每个数据点包含一个标记为input_text的段落和一个标记为target_text的问题。

  • 示例

    { input_text: Bayesian learning uses Bayes theorem to determine the conditional probability of a hypotheses given some evidence or observations. target_text: What is Bayesian learning in machine learning }

  • 附加文件:提供两个单独的文件,passages_unaligned.csv和questions_unaligned.csv,分别包含未对齐的段落和问题,标记为input_texttarget_text

附加信息

  • 许可证信息:详情见LICENSE.md

  • 引用信息

    @inproceedings{kulshreshtha-etal-2021-back, title = "Back-Training excels Self-Training at Unsupervised Domain Adaptation of Question Generation and Passage Retrieval", author = "Kulshreshtha, Devang and Belfer, Robert and Serban, Iulian Vlad and Reddy, Siva", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.566", pages = "7064--7078", abstract = "In this work, we introduce back-training, an alternative to self-training for unsupervised domain adaptation (UDA). While self-training generates synthetic training data where natural inputs are aligned with noisy outputs, back-training results in natural outputs aligned with noisy inputs. This significantly reduces the gap between target domain and synthetic data distribution, and reduces model overfitting to source domain. We run UDA experiments on question generation and passage retrieval from the Natural Questions domain to machine learning and biomedical domains. We find that back-training vastly outperforms self-training by a mean improvement of 7.8 BLEU-4 points on generation, and 17.6{%} top-20 retrieval accuracy across both domains. We further propose consistency filters to remove low-quality synthetic data before training. We also release a new domain-adaptation dataset - MLQuestions containing 35K unaligned questions, 50K unaligned passages, and 3K aligned question-passage pairs.", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作