five

Mistake-To-Meaning

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/ProCreations/Mistake-To-Meaning
下载链接
链接失效反馈
官方服务:
资源简介:
# Clear Spelling Dataset ## Overview The **Mistake to Meaning** (M2M) dataset is a carefully crafted synthetic collection of **100,000 unique English spelling mistakes and their correct forms**, intended for training high-quality typo correction and spell checking AI models. It covers various types of common mistakes observed frequently in real-world scenarios, such as: - Keyboard adjacency typos - Letter swaps and omissions - Duplicate characters - Phonetic substitution errors - Commonly confused homophones (e.g., "their" vs. "there") ## Dataset Format The dataset is provided in **CSV format** with two clearly defined columns: | Column | Description | Example | |----------|---------------------------------------------|---------------------| | `error` | The misspelled or incorrect word or phrase | "teh" | | `correct`| The correct word or intended phrase | "the" | ## Usage This dataset is ideal for: - Training and fine-tuning **typo correction** models - Benchmarking **spell-checking algorithms** - Enhancing NLP model robustness to real-world noisy input ## Quality Assurance - **No duplicates:** Each (error, correct) pair is unique. - **Hand-curated seed set:** Includes hundreds of common misspellings verified against real-world usage patterns. - **Realistic noise generation:** Uses realistic error transformations mimicking genuine human typing behavior. ## License (MIT) This dataset is released under the permissive **MIT License**, which allows commercial and non-commercial use, distribution, and modification. Attribution is required: ## Citation If you use this dataset in your research or projects, please provide attribution similar to: ``` This [your project type] uses the Mistake to Learning dataset by ProCreations. ``` Enjoy training your typo-correction models!

# 清晰拼写数据集 ## 概述 **错词转语义(Mistake to Meaning,简称M2M)** 数据集是一套精心打造的合成数据集,包含10万组独特的英语拼写错误及其对应正确形式,旨在用于训练高质量的拼写错误纠正与拼写检查人工智能模型。该数据集覆盖现实场景中高频出现的各类常见错误类型,具体包括: - 键盘相邻按键误输 - 字母调换与遗漏 - 字符重复 - 语音替换错误 - 易混淆同音异义词(例如"their"与"there") ## 数据集格式 本数据集采用**逗号分隔值(CSV)**格式提供,包含两个明确定义的字段: | 字段名 | 描述 | 示例 | |----------|--------------------------------------|---------------------| | `error` | 拼写错误或不规范的单词/短语 | "teh" | | `correct`| 正确的单词或预期表达的短语 | "the" | ## 使用场景 本数据集适用于以下场景: - 训练与微调**拼写错误纠正**模型 - 对**拼写检查算法**开展基准测试 - 提升自然语言处理(Natural Language Processing,简称NLP)模型对现实噪声输入的鲁棒性 ## 质量保障 - **无重复条目**:每一组(错误项,正确项)对均唯一。 - **人工精选种子集(hand-curated seed set)**:包含数百组经现实使用模式验证的常见拼写错误。 - **逼真噪声生成**:采用模拟真实人类打字行为的合理错误转换规则。 ## 许可证(MIT) 本数据集采用宽松的**MIT许可证**发布,允许商业与非商业使用、分发及修改,但需注明来源: ## 引用说明 若您在研究或项目中使用本数据集,请采用如下格式进行标注: 本[您的项目类型]使用了由ProCreations发布的Mistake to Learning数据集。 祝您的拼写纠错模型训练顺利!
提供机构:
maas
创建时间:
2025-08-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作