five

ludwigschmidt/squadshifts

收藏
Hugging Face2024-01-18 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/ludwigschmidt/squadshifts
下载链接
链接失效反馈
官方服务:
资源简介:
SQuAD-shifts数据集是一个用于问答任务的数据集,包含四个不同领域的测试集:维基百科文章、纽约时报文章、Reddit评论和亚马逊产品评论。这些数据集使用与原始SQuAD v1.1数据集相同的数据生成管道、Amazon Mechanical Turk界面和数据清理代码生成。new-wikipedia数据集用于测量对原始SQuAD v1.1数据集的过拟合,而new-york-times、reddit和amazon数据集则用于测量对自然分布变化的鲁棒性。数据集的结构包括id、title、context、question和answers等字段,且所有数据集均以测试集形式提供。

The SQuAD-shifts dataset is a question answering (QA) dataset consisting of four test sets from distinct domains: Wikipedia articles, New York Times articles, Reddit comments, and Amazon product reviews. These datasets were generated using the same data generation pipeline, Amazon Mechanical Turk interface, and data cleaning code as the original SQuAD v1.1 dataset. The new-wikipedia dataset is used to measure overfitting to the original SQuAD v1.1 dataset, while the New York Times, Reddit, and Amazon datasets are employed to evaluate robustness against natural distribution shifts. The dataset structure includes fields such as id, title, context, question, and answers, and all datasets are provided as test sets.
提供机构:
ludwigschmidt
原始信息汇总

数据集概述

数据集名称: SQuAD-shifts

数据集大小: 10K<n<100K

语言: 英语(en)

许可证: CC-BY-4.0

多语言性: 单语

任务类别: 问答(question-answering)

任务ID: extractive-qa

数据集配置:

  • new_wiki
    • 下载大小:16.50 MB
    • 数据集大小:7.86 MB
    • 测试集示例数:7938
  • nyt
    • 下载大小:16.50 MB
    • 数据集大小:10.79 MB
    • 测试集示例数:10065
  • reddit
    • 下载大小:16.50 MB
    • 数据集大小:9.47 MB
    • 测试集示例数:9803
  • amazon
    • 下载大小:16.50 MB
    • 数据集大小:9.44 MB
    • 测试集示例数:9885

数据集特征:

  • id:字符串类型
  • title:字符串类型
  • context:字符串类型
  • question:字符串类型
  • answers:字典类型,包含
    • text:字符串类型
    • answer_start:整数类型(int32)

数据集创建:

  • 注释创建者: 众包
  • 语言创建者: 众包和发现

许可证信息:

  • 所有数据集均根据CC BY 4.0许可分发。

引用信息:

@InProceedings{pmlr-v119-miller20a, title = {The Effect of Natural Distribution Shift on Question Answering Models}, author = {Miller, John and Krauth, Karl and Recht, Benjamin and Schmidt, Ludwig}, booktitle = {Proceedings of the 37th International Conference on Machine Learning}, pages = {6905--6916}, year = {2020}, editor = {III, Hal Daumé and Singh, Aarti}, volume = {119}, series = {Proceedings of Machine Learning Research}, month = {13--18 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v119/miller20a/miller20a.pdf}, url = {https://proceedings.mlr.press/v119/miller20a.html}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作