Kagura-Ahad-123/UrduGEC-Synthetic

Name: Kagura-Ahad-123/UrduGEC-Synthetic
Creator: Kagura-Ahad-123
Published: 2026-04-11 17:48:12
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Kagura-Ahad-123/UrduGEC-Synthetic

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: incorrect_sentence dtype: string - name: correct_sentence dtype: string - name: errant_style_edit_tags dtype: string splits: - name: train num_bytes: 778699754 num_examples: 1270499 download_size: 97562280 dataset_size: 778699754 configs: - config_name: default data_files: - split: train path: data/train-* license: cc-by-sa-3.0 language: - ur tags: - gec - urdu - synthetic-data - nlp pretty_name: UrduGEC-Synthetic size_categories: - 1M<n<10M --- # UrduGEC-Synthetic ## Dataset Summary This is the synthetic training corpus for Urdu Grammatical Error Correction (GEC) introduced in the paper **"Corpora Generation for Urdu Grammatical Error Correction"** (Accepted at ACL Findings). The dataset contains approximately **1.27 million** sentence pairs. It was created by mining naturally occurring error patterns from Urdu Wikipedia revision histories and re-inflicting them onto clean text (Makhzan Corpus) using a kernel-based inflection pipeline. ## Dataset Structure - `incorrect_sentence`: Sentence with synthetically injected grammatical errors. - `correct_sentence`: The original grammatically correct sentence. - `errant_style_edit_tags`: ERRANT-style tags describing the linguistic edits. ## Methodology - **Source:** Wikipedia Revision Histories & Makhzan Corpus. - **Error Inflection:** Kernel-based approach with a context window of $k=3$. - **Sampling:** Temperature sampling ($\tau = 0.5$) to ensure representation of rare morphological errors. ## Citation For now, the paper is available in reviewed format on openreview.net. Upon acceptance, we will update the citation with the final version.

提供机构：

Kagura-Ahad-123

5,000+

优质数据集

54 个

任务类型

进入经典数据集