lang-uk/Reddit-MultiGEC
收藏Hugging Face2025-07-28 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/lang-uk/Reddit-MultiGEC
下载链接
链接失效反馈官方服务:
资源简介:
Reddit-MultiGEC数据集是一个从Reddit抓取的大型多语言语料库,使用(TBU)方法自动更正。数据集包含多种语言版本,主要有英文、德文、捷克文、意大利文等。数据集的结构包括两个主要文件:reddit_multi_gec.csv,包含原始文本和更正后的文本;reddit_uk_annotations.csv,包含1500个乌克兰语样本的人工注释。数据集适用于文本到文本生成和文本生成任务。
The Reddit-MultiGEC dataset is a large multilingual corpus of posts scraped from Reddit, automatically corrected using the (TBU) approach. The dataset includes multiple language versions, mainly English, German, Czech, Italian, etc. The structure of the dataset includes two main files: reddit_multi_gec.csv, containing the original and corrected text; and reddit_uk_annotations.csv, containing human annotations for 1500 Ukrainian samples. The dataset is suitable for text-to-text generation and text generation tasks.
提供机构:
lang-uk



