Mistake-To-Meaning

Name: Mistake-To-Meaning
Creator: maas
Published: 2025-12-05 16:47:21
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/ProCreations/Mistake-To-Meaning

下载链接

链接失效反馈

官方服务：

资源简介：

# Clear Spelling Dataset ## Overview The **Mistake to Meaning** (M2M) dataset is a carefully crafted synthetic collection of **100,000 unique English spelling mistakes and their correct forms**, intended for training high-quality typo correction and spell checking AI models. It covers various types of common mistakes observed frequently in real-world scenarios, such as: - Keyboard adjacency typos - Letter swaps and omissions - Duplicate characters - Phonetic substitution errors - Commonly confused homophones (e.g., "their" vs. "there") ## Dataset Format The dataset is provided in **CSV format** with two clearly defined columns: | Column | Description | Example | |----------|---------------------------------------------|---------------------| | `error` | The misspelled or incorrect word or phrase | "teh" | | `correct`| The correct word or intended phrase | "the" | ## Usage This dataset is ideal for: - Training and fine-tuning **typo correction** models - Benchmarking **spell-checking algorithms** - Enhancing NLP model robustness to real-world noisy input ## Quality Assurance - **No duplicates:** Each (error, correct) pair is unique. - **Hand-curated seed set:** Includes hundreds of common misspellings verified against real-world usage patterns. - **Realistic noise generation:** Uses realistic error transformations mimicking genuine human typing behavior. ## License (MIT) This dataset is released under the permissive **MIT License**, which allows commercial and non-commercial use, distribution, and modification. Attribution is required: ## Citation If you use this dataset in your research or projects, please provide attribution similar to: ``` This [your project type] uses the Mistake to Learning dataset by ProCreations. ``` Enjoy training your typo-correction models!

# 清晰拼写数据集 ## 概述 **错词转语义（Mistake to Meaning，简称M2M）** 数据集是一套精心打造的合成数据集，包含10万组独特的英语拼写错误及其对应正确形式，旨在用于训练高质量的拼写错误纠正与拼写检查人工智能模型。该数据集覆盖现实场景中高频出现的各类常见错误类型，具体包括： - 键盘相邻按键误输 - 字母调换与遗漏 - 字符重复 - 语音替换错误 - 易混淆同音异义词（例如"their"与"there"） ## 数据集格式本数据集采用**逗号分隔值（CSV）**格式提供，包含两个明确定义的字段： | 字段名 | 描述 | 示例 | |----------|--------------------------------------|---------------------| | `error` | 拼写错误或不规范的单词/短语 | "teh" | | `correct`| 正确的单词或预期表达的短语 | "the" | ## 使用场景本数据集适用于以下场景： - 训练与微调**拼写错误纠正**模型 - 对**拼写检查算法**开展基准测试 - 提升自然语言处理（Natural Language Processing，简称NLP）模型对现实噪声输入的鲁棒性 ## 质量保障 - **无重复条目**：每一组（错误项，正确项）对均唯一。 - **人工精选种子集（hand-curated seed set）**：包含数百组经现实使用模式验证的常见拼写错误。 - **逼真噪声生成**：采用模拟真实人类打字行为的合理错误转换规则。 ## 许可证（MIT）本数据集采用宽松的**MIT许可证**发布，允许商业与非商业使用、分发及修改，但需注明来源： ## 引用说明若您在研究或项目中使用本数据集，请采用如下格式进行标注：本[您的项目类型]使用了由ProCreations发布的Mistake to Learning数据集。祝您的拼写纠错模型训练顺利！

提供机构：

maas

创建时间：

2025-08-20

5,000+

优质数据集

54 个

任务类型

进入经典数据集