RomanCast/WikiSpell_custom
收藏Hugging Face2023-07-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/RomanCast/WikiSpell_custom
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
dataset_info:
features:
- name: text
dtype: string
splits:
- name: train
num_bytes: 129624
num_examples: 10000
- name: validation_top1
num_bytes: 10754
num_examples: 1000
- name: test_top1
num_bytes: 10948
num_examples: 1000
- name: validation_1_10
num_bytes: 11618
num_examples: 1000
- name: test_1_10
num_bytes: 11692
num_examples: 1000
- name: validation_10_20
num_bytes: 13401
num_examples: 1000
- name: test_10_20
num_bytes: 13450
num_examples: 1000
- name: validation_20_30
num_bytes: 15112
num_examples: 1000
- name: test_20_30
num_bytes: 15069
num_examples: 1000
- name: validation_bottom50
num_bytes: 15204
num_examples: 1000
- name: test_bottom50
num_bytes: 15076
num_examples: 1000
download_size: 241234
dataset_size: 261948
language:
- en
viewer: true
task_categories:
- text-generation
size_categories:
- 1K<n<10K
---
# WikiSpell
## Description
This dataset is a **custom implementation** of the WikiSpell dataset introduced in [Character-Aware Models Improve Visual Text Rendering](https://arxiv.org/pdf/2212.10562.pdf) by Liu et al. (2022).
Similarly to the original WikiSpell dataset, the training set is composed of 5000 words taken uniformly from the 50% least common Wiktionary words (taken from [this Wiktionary extraction](https://kaikki.org/dictionary/rawdata.html)), and 5000 words sampled according to their frequencies taken from the 50% most common Wiktionary words.
The validation and test are splitted in 5 sets, sampled depending on their frequency in the corpus:
- 1% most common words
- 1 - 10% most common words
- 10 - 20% most common words
- 20 - 30% most common words
- 50% least common words
Contrary to the original WikiSpell dataset, we compute the frequency of the words using the first 100k sentences from OpenWebText ([Skylion007/openwebtext](https://huggingface.co/datasets/Skylion007/openwebtext)) instead of mC4.
## Usage
This dataset is used for testing spelling in Large Language Models. To do so, the labels should be computed like in the following snippet:
```python
sample = ds["train"][0]
label = " ".join(sample["text"])
```
**The labels are not included in the dataset files directly.**
## Citation
Please cite the original paper introducing WikiSpell if you're using this dataset:
```
@inproceedings{liu-etal-2023-character,
title = "Character-Aware Models Improve Visual Text Rendering",
author = "Liu, Rosanne and
Garrette, Dan and
Saharia, Chitwan and
Chan, William and
Roberts, Adam and
Narang, Sharan and
Blok, Irina and
Mical, Rj and
Norouzi, Mohammad and
Constant, Noah",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.900",
pages = "16270--16297",
}
```
提供机构:
RomanCast
原始信息汇总
数据集概述
基本信息
- 名称: WikiSpell
- 许可证: MIT
- 语言: 英语 (en)
- 可视化: 支持
- 任务类别: 文本生成
- 大小类别: 1K<n<10K
数据特征
- 特征:
text(字符串类型)
数据分割
- 训练集:
- 示例数量: 10000
- 字节数: 129624
- 验证集:
validation_top1: 1000示例, 10754字节validation_1_10: 1000示例, 11618字节validation_10_20: 1000示例, 13401字节validation_20_30: 1000示例, 15112字节validation_bottom50: 1000示例, 15204字节
- 测试集:
test_top1: 1000示例, 10948字节test_1_10: 1000示例, 11692字节test_10_20: 1000示例, 13450字节test_20_30: 1000示例, 15069字节test_bottom50: 1000示例, 15076字节
数据大小
- 下载大小: 241234字节
- 数据集大小: 261948字节
数据来源与处理
- 训练集包含5000个来自Wiktionary中最不常见的50%的单词和5000个根据频率从最常见的50%中抽样的单词。
- 验证和测试集根据单词在语料库中的频率分为5个部分。
- 频率计算基于OpenWebText的前100k句子,而非mC4。
使用说明
- 数据集用于测试大型语言模型的拼写能力。
- 标签不直接包含在数据集文件中,需要通过代码片段生成。
引用信息
- 使用此数据集时,请引用原始WikiSpell论文。



