five

MuCGEC多参考多来源汉语学习者文本纠错数据集

收藏
阿里云天池2026-05-27 更新2024-03-07 收录
下载链接:
https://tianchi.aliyun.com/dataset/131328
下载链接
链接失效反馈
官方服务:
资源简介:
给定一段中文文本,中文语法纠错(Chinese Grammatical Error Correction, CGEC)技术旨在对其中存在的拼写、词法、语法等各类错误进行自动纠正。该技术在教育、新闻、通讯乃至搜索等领域都拥有着广阔的应用空间。现有的CGEC评测数据集存在着数据量小,领域单一,参考答案数目少的缺陷。针对上述问题,苏州大学、阿里巴巴达摩院联合发布了MuCGEC(Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction)多源多参考中文语法纠错评测数据集,并依托第二十一届中国计算语言学大会(CCL 2022)在阿里云天池平台举办了评测任务,旨在推动CGEC领域的发展。 </br> MuCGEC数据集论文: https://arxiv.org/abs/2204.10994 </br> MuCGEC数据集Github(基线模型和评价程序): https://github.com/HillZhang1999/MuCGEC </br> CCL2022评测任务官网:http://www.cips-cl.org/static/CCL2022/cclEval/taskEvaluation/index.html </br> CCL2022-CLTC汉语学习者文本纠错评测Github:https://github.com/blcuicall/CCL2022-CLTC

Given a segment of Chinese text, Chinese Grammatical Error Correction (CGEC) technologies aim to automatically rectify various errors including spelling, lexical, and grammatical mistakes within the text. This technology exhibits broad application prospects across domains such as education, journalism, communication, and search. However, existing CGEC evaluation datasets face several critical limitations: small scale, limited domain diversity, and a scarcity of reference annotations. To mitigate these issues, Soochow University and Alibaba DAMO Academy jointly developed and released the MuCGEC (Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction), a multi-source, multi-reference evaluation dataset for Chinese grammatical error correction. They additionally hosted an evaluation shared task on the Alibaba Cloud Tianchi platform in conjunction with the 21st China National Conference on Computational Linguistics (CCL 2022), with the goal of advancing the CGEC research field. </br> MuCGEC Dataset Paper: https://arxiv.org/abs/2204.10994 </br> MuCGEC Dataset Github Repository (Baseline Models and Evaluation Scripts): https://github.com/HillZhang1999/MuCGEC </br> CCL 2022 Shared Task Official Website: http://www.cips-cl.org/static/CCL2022/cclEval/taskEvaluation/index.html </br> CCL2022-CLTC Chinese Learner Text Error Correction Evaluation Github Repository: https://github.com/blcuicall/CCL2022-CLTC
提供机构:
阿里云天池
创建时间:
2022-05-31
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
MuCGEC是一个多参考多来源的中文语法纠错数据集,具有参考答案多样、来源多样和标注质量高的特点,旨在更准确地评估中文语法纠错模型的性能。数据集由苏州大学和阿里巴巴达摩院联合发布,包含来自汉语学习者的三种不同数据源,平均每句标注了2.3个参考答案,并采用了严格的标注流程。
以上内容由遇见数据集搜集并总结生成
二维码
社区交流群
二维码
科研交流群
商业服务