Kagura-Ahad-123/UrduGEC-Synthetic
收藏Hugging Face2026-04-11 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Kagura-Ahad-123/UrduGEC-Synthetic
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: incorrect_sentence
dtype: string
- name: correct_sentence
dtype: string
- name: errant_style_edit_tags
dtype: string
splits:
- name: train
num_bytes: 778699754
num_examples: 1270499
download_size: 97562280
dataset_size: 778699754
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: cc-by-sa-3.0
language:
- ur
tags:
- gec
- urdu
- synthetic-data
- nlp
pretty_name: UrduGEC-Synthetic
size_categories:
- 1M<n<10M
---
# UrduGEC-Synthetic
## Dataset Summary
This is the synthetic training corpus for Urdu Grammatical Error Correction (GEC) introduced in the paper **"Corpora Generation for Urdu Grammatical Error Correction"** (Accepted at ACL Findings).
The dataset contains approximately **1.27 million** sentence pairs. It was created by mining naturally occurring error patterns from Urdu Wikipedia revision histories and re-inflicting them onto clean text (Makhzan Corpus) using a kernel-based inflection pipeline.
## Dataset Structure
- `incorrect_sentence`: Sentence with synthetically injected grammatical errors.
- `correct_sentence`: The original grammatically correct sentence.
- `errant_style_edit_tags`: ERRANT-style tags describing the linguistic edits.
## Methodology
- **Source:** Wikipedia Revision Histories & Makhzan Corpus.
- **Error Inflection:** Kernel-based approach with a context window of $k=3$.
- **Sampling:** Temperature sampling ($\tau = 0.5$) to ensure representation of rare morphological errors.
## Citation
For now, the paper is available in reviewed format on openreview.net. Upon acceptance, we will update the citation with the final version.
提供机构:
Kagura-Ahad-123



