five

aitetic/WikiDialog-OQ

收藏
Hugging Face2026-03-31 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/aitetic/WikiDialog-OQ
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: train path: "WikiDialog-OQ.jsonl.gz" - split: validation path: "WikiDialog-OQ-validation.jsonl" --- # WikiDialog-OQ Dataset containing 11M information-seeking conversations from passages in English Wikipedia, publicly available. Each conversation was generated using the dialog inpainting method detailed in the paper using the Inpaint-OQ inpainter model, a T5-XXL model that was fine-tuned on OR-QuAC and QReCC using a dialog reconstruction loss. ### Abstract Many important questions (e.g. "How to eat healthier?") require conversation to establish context and explore in depth. However, conversational question answering (ConvQA) systems have long been stymied by scarce training data that is expensive to collect. To address this problem, we propose a new technique for synthetically generating diverse and high-quality dialog data: dialog inpainting. Our approach takes the text of any document and transforms it into a two-person dialog between the writer and an imagined reader: we treat sentences from the article as utterances spoken by the writer, and then use a dialog inpainter to predict what the imagined reader asked or said in between each of the writer's utterances. By applying this approach to passages from Wikipedia and the web, we produce WikiDialog and WebDialog, two datasets totalling 19 million diverse information-seeking dialogs---1,000x larger than the largest existing ConvQA dataset. Furthermore, human raters judge the answer adequacy and conversationality of WikiDialog to be as good or better than existing manually-collected datasets. Using our inpainted data to pre-train ConvQA retrieval systems, we significantly advance state-of-the-art across three benchmarks (QReCC, OR-QuAC, TREC CaST) yielding up to 40% relative gains on standard evaluation metrics. **Version**: * 1.0.0 (default): Initial release. **Examples**: * Train: 11,264,129 * Validation: 113,822 ``` @inproceedings{dai2022dialoginpainting, title={Dialog Inpainting: Turning Documents to Dialogs}, author={Dai, Zhuyun and Chaganty, Arun Tejasvi and Zhao, Vincent and Amini, Aida and Green, Mike and Rashid, Qazi and Guu, Kelvin}, booktitle={International Conference on Machine Learning (ICML)}, year={2022}, organization={PMLR} } ```
提供机构:
aitetic
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作