RomanCast/WikiSpell_custom

Name: RomanCast/WikiSpell_custom
Creator: RomanCast
Published: 2023-07-25 12:59:58
License: 暂无描述

Hugging Face2023-07-25 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/RomanCast/WikiSpell_custom

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit dataset_info: features: - name: text dtype: string splits: - name: train num_bytes: 129624 num_examples: 10000 - name: validation_top1 num_bytes: 10754 num_examples: 1000 - name: test_top1 num_bytes: 10948 num_examples: 1000 - name: validation_1_10 num_bytes: 11618 num_examples: 1000 - name: test_1_10 num_bytes: 11692 num_examples: 1000 - name: validation_10_20 num_bytes: 13401 num_examples: 1000 - name: test_10_20 num_bytes: 13450 num_examples: 1000 - name: validation_20_30 num_bytes: 15112 num_examples: 1000 - name: test_20_30 num_bytes: 15069 num_examples: 1000 - name: validation_bottom50 num_bytes: 15204 num_examples: 1000 - name: test_bottom50 num_bytes: 15076 num_examples: 1000 download_size: 241234 dataset_size: 261948 language: - en viewer: true task_categories: - text-generation size_categories: - 1K<n<10K --- # WikiSpell ## Description This dataset is a **custom implementation** of the WikiSpell dataset introduced in [Character-Aware Models Improve Visual Text Rendering](https://arxiv.org/pdf/2212.10562.pdf) by Liu et al. (2022). Similarly to the original WikiSpell dataset, the training set is composed of 5000 words taken uniformly from the 50% least common Wiktionary words (taken from [this Wiktionary extraction](https://kaikki.org/dictionary/rawdata.html)), and 5000 words sampled according to their frequencies taken from the 50% most common Wiktionary words. The validation and test are splitted in 5 sets, sampled depending on their frequency in the corpus: - 1% most common words - 1 - 10% most common words - 10 - 20% most common words - 20 - 30% most common words - 50% least common words Contrary to the original WikiSpell dataset, we compute the frequency of the words using the first 100k sentences from OpenWebText ([Skylion007/openwebtext](https://huggingface.co/datasets/Skylion007/openwebtext)) instead of mC4. ## Usage This dataset is used for testing spelling in Large Language Models. To do so, the labels should be computed like in the following snippet: ```python sample = ds["train"][0] label = " ".join(sample["text"]) ``` **The labels are not included in the dataset files directly.** ## Citation Please cite the original paper introducing WikiSpell if you're using this dataset: ``` @inproceedings{liu-etal-2023-character, title = "Character-Aware Models Improve Visual Text Rendering", author = "Liu, Rosanne and Garrette, Dan and Saharia, Chitwan and Chan, William and Roberts, Adam and Narang, Sharan and Blok, Irina and Mical, Rj and Norouzi, Mohammad and Constant, Noah", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.acl-long.900", pages = "16270--16297", } ```

提供机构：

RomanCast

原始信息汇总

数据集概述

基本信息

名称: WikiSpell
许可证: MIT
语言: 英语 (en)
可视化: 支持
任务类别: 文本生成
大小类别: 1K<n<10K

数据特征

特征:
- text (字符串类型)

数据分割

训练集:
- 示例数量: 10000
- 字节数: 129624
验证集:
- validation_top1: 1000示例, 10754字节
- validation_1_10: 1000示例, 11618字节
- validation_10_20: 1000示例, 13401字节
- validation_20_30: 1000示例, 15112字节
- validation_bottom50: 1000示例, 15204字节
测试集:
- test_top1: 1000示例, 10948字节
- test_1_10: 1000示例, 11692字节
- test_10_20: 1000示例, 13450字节
- test_20_30: 1000示例, 15069字节
- test_bottom50: 1000示例, 15076字节

数据大小

下载大小: 241234字节
数据集大小: 261948字节

数据来源与处理

训练集包含5000个来自Wiktionary中最不常见的50%的单词和5000个根据频率从最常见的50%中抽样的单词。
验证和测试集根据单词在语料库中的频率分为5个部分。
频率计算基于OpenWebText的前100k句子，而非mC4。

使用说明

数据集用于测试大型语言模型的拼写能力。
标签不直接包含在数据集文件中，需要通过代码片段生成。

引用信息

使用此数据集时，请引用原始WikiSpell论文。

5,000+

优质数据集

54 个

任务类型

进入经典数据集