Grammar Error Correction Corpus for Czech (GECCC)
收藏arXiv2022-04-21 更新2024-06-21 收录
下载链接:
http://hdl.handle.net/11234/1-4639
下载链接
链接失效反馈官方服务:
资源简介:
Grammar Error Correction Corpus for Czech (GECCC)是由查尔斯大学创建的一个大型且多样化的捷克语法错误修正数据集,包含83,058个句子,覆盖四个领域:本土学生撰写的正式论文、非正式网站讨论、罗姆族儿童和青少年撰写的论文以及非本土学习者的论文。该数据集旨在解决非英语语言中数据资源的稀缺问题,并提供广泛领域覆盖。数据集经过专业标注,用于语法错误修正,是目前已知的最大的非英语语法错误修正数据集之一。
Grammar Error Correction Corpus for Czech (GECCC) is a large, diverse Czech grammar error correction dataset developed by Charles University. It contains 83,058 sentences spanning four domains: formal essays written by native Czech students, informal web forum discussions, essays composed by Romani children and adolescents, and essays written by non-native learners of Czech. This corpus aims to address the scarcity of data resources for non-English languages while providing broad domain coverage. Professionally annotated for grammar error correction tasks, GECCC is currently one of the largest known non-English grammar error correction datasets.
提供机构:
查尔斯大学
创建时间:
2022-01-15



