five

RomanCast/WikiSpell_custom

收藏
Hugging Face2023-07-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/RomanCast/WikiSpell_custom
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit dataset_info: features: - name: text dtype: string splits: - name: train num_bytes: 129624 num_examples: 10000 - name: validation_top1 num_bytes: 10754 num_examples: 1000 - name: test_top1 num_bytes: 10948 num_examples: 1000 - name: validation_1_10 num_bytes: 11618 num_examples: 1000 - name: test_1_10 num_bytes: 11692 num_examples: 1000 - name: validation_10_20 num_bytes: 13401 num_examples: 1000 - name: test_10_20 num_bytes: 13450 num_examples: 1000 - name: validation_20_30 num_bytes: 15112 num_examples: 1000 - name: test_20_30 num_bytes: 15069 num_examples: 1000 - name: validation_bottom50 num_bytes: 15204 num_examples: 1000 - name: test_bottom50 num_bytes: 15076 num_examples: 1000 download_size: 241234 dataset_size: 261948 language: - en viewer: true task_categories: - text-generation size_categories: - 1K<n<10K --- # WikiSpell ## Description This dataset is a **custom implementation** of the WikiSpell dataset introduced in [Character-Aware Models Improve Visual Text Rendering](https://arxiv.org/pdf/2212.10562.pdf) by Liu et al. (2022). Similarly to the original WikiSpell dataset, the training set is composed of 5000 words taken uniformly from the 50% least common Wiktionary words (taken from [this Wiktionary extraction](https://kaikki.org/dictionary/rawdata.html)), and 5000 words sampled according to their frequencies taken from the 50% most common Wiktionary words. The validation and test are splitted in 5 sets, sampled depending on their frequency in the corpus: - 1% most common words - 1 - 10% most common words - 10 - 20% most common words - 20 - 30% most common words - 50% least common words Contrary to the original WikiSpell dataset, we compute the frequency of the words using the first 100k sentences from OpenWebText ([Skylion007/openwebtext](https://huggingface.co/datasets/Skylion007/openwebtext)) instead of mC4. ## Usage This dataset is used for testing spelling in Large Language Models. To do so, the labels should be computed like in the following snippet: ```python sample = ds["train"][0] label = " ".join(sample["text"]) ``` **The labels are not included in the dataset files directly.** ## Citation Please cite the original paper introducing WikiSpell if you're using this dataset: ``` @inproceedings{liu-etal-2023-character, title = "Character-Aware Models Improve Visual Text Rendering", author = "Liu, Rosanne and Garrette, Dan and Saharia, Chitwan and Chan, William and Roberts, Adam and Narang, Sharan and Blok, Irina and Mical, Rj and Norouzi, Mohammad and Constant, Noah", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.acl-long.900", pages = "16270--16297", } ```
提供机构:
RomanCast
原始信息汇总

数据集概述

基本信息

  • 名称: WikiSpell
  • 许可证: MIT
  • 语言: 英语 (en)
  • 可视化: 支持
  • 任务类别: 文本生成
  • 大小类别: 1K<n<10K

数据特征

  • 特征:
    • text (字符串类型)

数据分割

  • 训练集:
    • 示例数量: 10000
    • 字节数: 129624
  • 验证集:
    • validation_top1: 1000示例, 10754字节
    • validation_1_10: 1000示例, 11618字节
    • validation_10_20: 1000示例, 13401字节
    • validation_20_30: 1000示例, 15112字节
    • validation_bottom50: 1000示例, 15204字节
  • 测试集:
    • test_top1: 1000示例, 10948字节
    • test_1_10: 1000示例, 11692字节
    • test_10_20: 1000示例, 13450字节
    • test_20_30: 1000示例, 15069字节
    • test_bottom50: 1000示例, 15076字节

数据大小

  • 下载大小: 241234字节
  • 数据集大小: 261948字节

数据来源与处理

  • 训练集包含5000个来自Wiktionary中最不常见的50%的单词和5000个根据频率从最常见的50%中抽样的单词。
  • 验证和测试集根据单词在语料库中的频率分为5个部分。
  • 频率计算基于OpenWebText的前100k句子,而非mC4。

使用说明

  • 数据集用于测试大型语言模型的拼写能力。
  • 标签不直接包含在数据集文件中,需要通过代码片段生成。

引用信息

  • 使用此数据集时,请引用原始WikiSpell论文。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作