usc-isi/WikiConvert

Name: usc-isi/WikiConvert
Creator: usc-isi
Published: 2022-10-24 17:40:43
License: 暂无描述

Hugging Face2022-10-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/usc-isi/WikiConvert

下载链接

链接失效反馈

官方服务：

资源简介：

Wiki-Convert是一个包含超过900,000个句子的数据集，这些句子来自英文维基百科，并且包含了精确的数字注释。数据集主要用于语言建模任务，特别是掩码语言建模任务。数据集的创建基于维基百科贡献者的注释，特别是使用了{{Convert}}模板。数据集的结构包括每个句子的元数据和单个数字的注释，注释形式包括长度和偏移量。数据集分为训练集、开发集和测试集。

提供机构：

usc-isi

原始信息汇总

数据集概述

名称: Wiki-Convert
语言: 英语（en-US）
许可证: MIT
多语言性: 单语
大小: 100K<n<1M
来源: 扩展自Wikipedia
任务类别:
- fill-mask
- other
- text-generation
任务ID:
- language-modeling
- masked-language-modeling
美观名称: Wiki-Convert
标签:
- numeracy
- natural-language-understanding
- tokenization

数据集描述

概述: Wiki-Convert是一个包含超过900,000个句子的数据集，来自英语Wikipedia，包含精确的数字注释。数据集依赖于Wiki贡献者的注释，形式为{{Convert}}模板。
支持的任务:
- sequence-modeling: 用于训练语言模型，任务成功通常通过低困惑度来衡量。

数据集结构

数据实例:
- 每个JSON文件的行包含源Wikipedia句子的元数据以及单个数字的注释，例如number: 10。注释形式为length和offset。
- 示例:
  
  { id: 1080801, UNIQUE_STORY_INDEX: 1080801, offset: 83, length: 2, magnitude: 0, comment: "Like all Type UB III submarines, UB-117 carried 10 torpedoes and was armed with a 10 cms deck gun. ", number: 10 }

数据分割

	Tain	Dev	Test
句子数	739,583	92,447	92,449

许可证

许可证: MIT License

引用信息

@inproceedings{thawani-etal-2021-numeracy, title = "Numeracy enhances the Literacy of Language Models", author = "Thawani, Avijit and Pujara, Jay and Ilievski, Filip", booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2021", address = "Online and Punta Cana, Dominican Republic", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.emnlp-main.557", pages = "6960--6967", abstract = "Specialized number representations in NLP have shown improvements on numerical reasoning tasks like arithmetic word problems and masked number prediction. But humans also use numeracy to make better sense of world concepts, e.g., you can seat 5 people in your {`}room{} but not 500. Does a better grasp of numbers improve a model{}s understanding of other concepts and words? This paper studies the effect of using six different number encoders on the task of masked word prediction (MWP), as a proxy for evaluating literacy. To support this investigation, we develop Wiki-Convert, a 900,000 sentence dataset annotated with numbers and units, to avoid conflating nominal and ordinal number occurrences. We find a significant improvement in MWP for sentences containing numbers, that exponent embeddings are the best number encoders, yielding over 2 points jump in prediction accuracy over a BERT baseline, and that these enhanced literacy skills also generalize to contexts without annotated numbers. We release all code at https://git.io/JuZXn.", }

5,000+

优质数据集

54 个

任务类型

进入经典数据集