MD-CSC多领域中文拼写纠错数据集

Name: MD-CSC多领域中文拼写纠错数据集
Creator: 苏州大学
License: 暂无描述

千言数据集2024-05-15 收录

下载链接：

https://www.luge.ai/#/luge/dataDetail?id=67

下载链接

链接失效反馈

官方服务：

资源简介：

中文拼写纠错(Chinese Spelling Check, CSC)旨在自动纠正给定中文句中的错别字，该技术广泛应用于搜索匹配、文档生成等场景。现有的CSC数据集存在数据规模小，覆盖领域单一，词级别拼写错误缺失等问题。因此，我们推出了多领域的中文拼写纠错数据集（Multi Domain Chinese Spelling Check，MD-CSC）。具体来说，我们收集了包括法律、医疗、公文写作三个领域相关的公开语料，并结合现有的多种输入法，人工标注了8000多条中文句对（源句-目标句）。其中，源句可能包含拼写错误的句子，目标句为正确句子。该数据集的主要特色为：1）覆盖领域多，包括法律、医疗、公文写作三个领域；2）错误类型齐全，包括字级别和词级别拼写错误。

Chinese Spelling Check (CSC) refers to the task of automatically correcting typos in provided Chinese sentences, and this technology has been widely deployed in scenarios including search matching and document generation. Existing CSC datasets have several critical drawbacks: limited data scale, narrow domain coverage, and the absence of word-level spelling error instances. To mitigate these limitations, we propose a multi-domain Chinese Spelling Check dataset, abbreviated as Multi-Domain Chinese Spelling Check (MD-CSC). Specifically, we collected publicly available corpora from three domains—law, medical, and official document writing—and manually annotated over 8,000 Chinese sentence pairs (source sentence, target sentence) with the help of multiple mainstream input methods. The source sentences may contain spelling errors, while the target sentences represent the fully corrected, error-free versions. The key characteristics of this dataset are: 1) Broad domain coverage, encompassing the three domains of law, medical, and official document writing; 2) Comprehensive error types, covering both character-level and word-level spelling errors.

提供机构：

苏州大学

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成