McGill-NLP/mlquestions
收藏数据集概述
数据集名称
- 名称:MLQuestions
数据集描述
- 摘要:MLQuestions数据集包含与机器学习领域相关的Google搜索查询问题和Wikipedia页面段落。该数据集旨在支持问题生成和段落检索模型的领域适应研究。
- 语言:数据集中的文本为英语。
数据集结构
-
数据实例:数据集发布开发集和测试集,每个数据点包含一个标记为
input_text的段落和一个标记为target_text的问题。 -
示例:
{ input_text: Bayesian learning uses Bayes theorem to determine the conditional probability of a hypotheses given some evidence or observations. target_text: What is Bayesian learning in machine learning }
-
附加文件:提供两个单独的文件,passages_unaligned.csv和questions_unaligned.csv,分别包含未对齐的段落和问题,标记为
input_text和target_text。
附加信息
-
许可证信息:详情见LICENSE.md
-
引用信息:
@inproceedings{kulshreshtha-etal-2021-back, title = "Back-Training excels Self-Training at Unsupervised Domain Adaptation of Question Generation and Passage Retrieval", author = "Kulshreshtha, Devang and Belfer, Robert and Serban, Iulian Vlad and Reddy, Siva", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.566", pages = "7064--7078", abstract = "In this work, we introduce back-training, an alternative to self-training for unsupervised domain adaptation (UDA). While self-training generates synthetic training data where natural inputs are aligned with noisy outputs, back-training results in natural outputs aligned with noisy inputs. This significantly reduces the gap between target domain and synthetic data distribution, and reduces model overfitting to source domain. We run UDA experiments on question generation and passage retrieval from the Natural Questions domain to machine learning and biomedical domains. We find that back-training vastly outperforms self-training by a mean improvement of 7.8 BLEU-4 points on generation, and 17.6{%} top-20 retrieval accuracy across both domains. We further propose consistency filters to remove low-quality synthetic data before training. We also release a new domain-adaptation dataset - MLQuestions containing 35K unaligned questions, 50K unaligned passages, and 3K aligned question-passage pairs.", }



