five

usc-isi/WikiConvert

收藏
Hugging Face2022-10-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/usc-isi/WikiConvert
下载链接
链接失效反馈
官方服务:
资源简介:
Wiki-Convert是一个包含超过900,000个句子的数据集,这些句子来自英文维基百科,并且包含了精确的数字注释。数据集主要用于语言建模任务,特别是掩码语言建模任务。数据集的创建基于维基百科贡献者的注释,特别是使用了{{Convert}}模板。数据集的结构包括每个句子的元数据和单个数字的注释,注释形式包括长度和偏移量。数据集分为训练集、开发集和测试集。
提供机构:
usc-isi
原始信息汇总

数据集概述

  • 名称: Wiki-Convert
  • 语言: 英语(en-US)
  • 许可证: MIT
  • 多语言性: 单语
  • 大小: 100K<n<1M
  • 来源: 扩展自Wikipedia
  • 任务类别:
    • fill-mask
    • other
    • text-generation
  • 任务ID:
    • language-modeling
    • masked-language-modeling
  • 美观名称: Wiki-Convert
  • 标签:
    • numeracy
    • natural-language-understanding
    • tokenization

数据集描述

  • 概述: Wiki-Convert是一个包含超过900,000个句子的数据集,来自英语Wikipedia,包含精确的数字注释。数据集依赖于Wiki贡献者的注释,形式为{{Convert}}模板。
  • 支持的任务:
    • sequence-modeling: 用于训练语言模型,任务成功通常通过低困惑度来衡量。

数据集结构

  • 数据实例:
    • 每个JSON文件的行包含源Wikipedia句子的元数据以及单个数字的注释,例如number: 10。注释形式为lengthoffset

    • 示例:

      { id: 1080801, UNIQUE_STORY_INDEX: 1080801, offset: 83, length: 2, magnitude: 0, comment: "Like all Type UB III submarines, UB-117 carried 10 torpedoes and was armed with a  10 cms deck gun. ", number: 10 }

数据分割

Tain Dev Test
句子数 739,583 92,447 92,449

许可证

  • 许可证: MIT License

引用信息

@inproceedings{thawani-etal-2021-numeracy, title = "Numeracy enhances the Literacy of Language Models", author = "Thawani, Avijit and Pujara, Jay and Ilievski, Filip", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.557", pages = "6960--6967", abstract = "Specialized number representations in NLP have shown improvements on numerical reasoning tasks like arithmetic word problems and masked number prediction. But humans also use numeracy to make better sense of world concepts, e.g., you can seat 5 people in your {`}room{} but not 500. Does a better grasp of numbers improve a model{}s understanding of other concepts and words? This paper studies the effect of using six different number encoders on the task of masked word prediction (MWP), as a proxy for evaluating literacy. To support this investigation, we develop Wiki-Convert, a 900,000 sentence dataset annotated with numbers and units, to avoid conflating nominal and ordinal number occurrences. We find a significant improvement in MWP for sentences containing numbers, that exponent embeddings are the best number encoders, yielding over 2 points jump in prediction accuracy over a BERT baseline, and that these enhanced literacy skills also generalize to contexts without annotated numbers. We release all code at https://git.io/JuZXn.", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作