Mistake-To-Meaning
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/ProCreations/Mistake-To-Meaning
下载链接
链接失效反馈官方服务:
资源简介:
# Clear Spelling Dataset
## Overview
The **Mistake to Meaning** (M2M) dataset is a carefully crafted synthetic collection of **100,000 unique English spelling mistakes and their correct forms**, intended for training high-quality typo correction and spell checking AI models. It covers various types of common mistakes observed frequently in real-world scenarios, such as:
- Keyboard adjacency typos
- Letter swaps and omissions
- Duplicate characters
- Phonetic substitution errors
- Commonly confused homophones (e.g., "their" vs. "there")
## Dataset Format
The dataset is provided in **CSV format** with two clearly defined columns:
| Column | Description | Example |
|----------|---------------------------------------------|---------------------|
| `error` | The misspelled or incorrect word or phrase | "teh" |
| `correct`| The correct word or intended phrase | "the" |
## Usage
This dataset is ideal for:
- Training and fine-tuning **typo correction** models
- Benchmarking **spell-checking algorithms**
- Enhancing NLP model robustness to real-world noisy input
## Quality Assurance
- **No duplicates:** Each (error, correct) pair is unique.
- **Hand-curated seed set:** Includes hundreds of common misspellings verified against real-world usage patterns.
- **Realistic noise generation:** Uses realistic error transformations mimicking genuine human typing behavior.
## License (MIT)
This dataset is released under the permissive **MIT License**, which allows commercial and non-commercial use, distribution, and modification. Attribution is required:
## Citation
If you use this dataset in your research or projects, please provide attribution similar to:
```
This [your project type] uses the Mistake to Learning dataset by ProCreations.
```
Enjoy training your typo-correction models!
# 清晰拼写数据集
## 概述
**错词转语义(Mistake to Meaning,简称M2M)** 数据集是一套精心打造的合成数据集,包含10万组独特的英语拼写错误及其对应正确形式,旨在用于训练高质量的拼写错误纠正与拼写检查人工智能模型。该数据集覆盖现实场景中高频出现的各类常见错误类型,具体包括:
- 键盘相邻按键误输
- 字母调换与遗漏
- 字符重复
- 语音替换错误
- 易混淆同音异义词(例如"their"与"there")
## 数据集格式
本数据集采用**逗号分隔值(CSV)**格式提供,包含两个明确定义的字段:
| 字段名 | 描述 | 示例 |
|----------|--------------------------------------|---------------------|
| `error` | 拼写错误或不规范的单词/短语 | "teh" |
| `correct`| 正确的单词或预期表达的短语 | "the" |
## 使用场景
本数据集适用于以下场景:
- 训练与微调**拼写错误纠正**模型
- 对**拼写检查算法**开展基准测试
- 提升自然语言处理(Natural Language Processing,简称NLP)模型对现实噪声输入的鲁棒性
## 质量保障
- **无重复条目**:每一组(错误项,正确项)对均唯一。
- **人工精选种子集(hand-curated seed set)**:包含数百组经现实使用模式验证的常见拼写错误。
- **逼真噪声生成**:采用模拟真实人类打字行为的合理错误转换规则。
## 许可证(MIT)
本数据集采用宽松的**MIT许可证**发布,允许商业与非商业使用、分发及修改,但需注明来源:
## 引用说明
若您在研究或项目中使用本数据集,请采用如下格式进行标注:
本[您的项目类型]使用了由ProCreations发布的Mistake to Learning数据集。
祝您的拼写纠错模型训练顺利!
提供机构:
maas
创建时间:
2025-08-20



