usc-isi/WikiConvert
收藏数据集概述
- 名称: Wiki-Convert
- 语言: 英语(en-US)
- 许可证: MIT
- 多语言性: 单语
- 大小: 100K<n<1M
- 来源: 扩展自Wikipedia
- 任务类别:
- fill-mask
- other
- text-generation
- 任务ID:
- language-modeling
- masked-language-modeling
- 美观名称: Wiki-Convert
- 标签:
- numeracy
- natural-language-understanding
- tokenization
数据集描述
- 概述: Wiki-Convert是一个包含超过900,000个句子的数据集,来自英语Wikipedia,包含精确的数字注释。数据集依赖于Wiki贡献者的注释,形式为{{Convert}}模板。
- 支持的任务:
- sequence-modeling: 用于训练语言模型,任务成功通常通过低困惑度来衡量。
数据集结构
- 数据实例:
-
每个JSON文件的行包含源Wikipedia句子的元数据以及单个数字的注释,例如
number: 10。注释形式为length和offset。 -
示例:
{ id: 1080801, UNIQUE_STORY_INDEX: 1080801, offset: 83, length: 2, magnitude: 0, comment: "Like all Type UB III submarines, UB-117 carried 10 torpedoes and was armed with a 10 cms deck gun. ", number: 10 }
-
数据分割
| Tain | Dev | Test | |
|---|---|---|---|
| 句子数 | 739,583 | 92,447 | 92,449 |
许可证
- 许可证: MIT License
引用信息
@inproceedings{thawani-etal-2021-numeracy, title = "Numeracy enhances the Literacy of Language Models", author = "Thawani, Avijit and Pujara, Jay and Ilievski, Filip", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.557", pages = "6960--6967", abstract = "Specialized number representations in NLP have shown improvements on numerical reasoning tasks like arithmetic word problems and masked number prediction. But humans also use numeracy to make better sense of world concepts, e.g., you can seat 5 people in your {`}room{} but not 500. Does a better grasp of numbers improve a model{}s understanding of other concepts and words? This paper studies the effect of using six different number encoders on the task of masked word prediction (MWP), as a proxy for evaluating literacy. To support this investigation, we develop Wiki-Convert, a 900,000 sentence dataset annotated with numbers and units, to avoid conflating nominal and ordinal number occurrences. We find a significant improvement in MWP for sentences containing numbers, that exponent embeddings are the best number encoders, yielding over 2 points jump in prediction accuracy over a BERT baseline, and that these enhanced literacy skills also generalize to contexts without annotated numbers. We release all code at https://git.io/JuZXn.", }



