five

Kagura-Ahad-123/UrduGEC-Synthetic

收藏
Hugging Face2026-04-11 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Kagura-Ahad-123/UrduGEC-Synthetic
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: incorrect_sentence dtype: string - name: correct_sentence dtype: string - name: errant_style_edit_tags dtype: string splits: - name: train num_bytes: 778699754 num_examples: 1270499 download_size: 97562280 dataset_size: 778699754 configs: - config_name: default data_files: - split: train path: data/train-* license: cc-by-sa-3.0 language: - ur tags: - gec - urdu - synthetic-data - nlp pretty_name: UrduGEC-Synthetic size_categories: - 1M<n<10M --- # UrduGEC-Synthetic ## Dataset Summary This is the synthetic training corpus for Urdu Grammatical Error Correction (GEC) introduced in the paper **"Corpora Generation for Urdu Grammatical Error Correction"** (Accepted at ACL Findings). The dataset contains approximately **1.27 million** sentence pairs. It was created by mining naturally occurring error patterns from Urdu Wikipedia revision histories and re-inflicting them onto clean text (Makhzan Corpus) using a kernel-based inflection pipeline. ## Dataset Structure - `incorrect_sentence`: Sentence with synthetically injected grammatical errors. - `correct_sentence`: The original grammatically correct sentence. - `errant_style_edit_tags`: ERRANT-style tags describing the linguistic edits. ## Methodology - **Source:** Wikipedia Revision Histories & Makhzan Corpus. - **Error Inflection:** Kernel-based approach with a context window of $k=3$. - **Sampling:** Temperature sampling ($\tau = 0.5$) to ensure representation of rare morphological errors. ## Citation For now, the paper is available in reviewed format on openreview.net. Upon acceptance, we will update the citation with the final version.
提供机构:
Kagura-Ahad-123
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作