KurCorpus 2B: A Multidialectal 2-Billion-Token Corpus for Kurdish Language Modeling
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://data.mendeley.com/datasets/fb5xhhn6m5
下载链接
链接失效反馈官方服务:
资源简介:
KurCorpus 2B and KurBERT 2B are two foundational resources designed to advance Natural Language Processing (NLP) for Kurdish, a historically under-resourced language with rich morphological complexity and multiple spoken varieties.
KurCorpus 2B is the largest publicly available Kurdish text corpus, with more than 2 billion tokens across the three major Kurdish dialects: Sorani, Badini, and Hawrami.
KurBERT 2B is the first large-scale multidialectal BERT-based language model for Kurdish, pretrained on KurCorpus 2B using Masked Language Modeling (MLM).
Both resources aim to foster reproducibility, cross-dialect adaptability, and future development in Kurdish NLP and AI research.
Dataset: KurCorpus 2B
Sources & Coverage
KurCorpus 2B was compiled from diverse sources to capture linguistic richness and dialectal variation:
• News websites
• Social media platforms (Facebook, Telegram)
• Digitized literary texts and online publications
Preprocessing Pipeline:
To ensure linguistic consistency, data integrity, and model readiness, KurCorpus 2B underwent a comprehensive text preprocessing pipeline tailored for multidialectal Kurdish text. This structured approach significantly improves the quality and usability of the dataset for transformer-based language modeling. The preprocessing steps include:
• Unicode Normalization (NFKC): Standardized character encoding to unify visually similar glyphs across Sorani, Badini, and Hawrami dialects.
• Orthographic Correction: Fixed common inconsistencies such as Arabic-script variations (ك → ک, ي → ی) to improve tokenization accuracy.
• Noise Removal: Eliminated emojis, URLs, email addresses, and HTML tags using placeholder tagging (e.g., [URL], [EMAIL]).
• Whitespace and Punctuation Fixing: Corrected irregular spacing and normalized punctuation marks for better syntactic processing.
• Stopword Filtering: Applied dialect-aware stopword lists to reduce noise while preserving meaningful linguistic structures.
• Token-level Cleaning: Ensured clean sentence boundaries and improved readability for downstream NLP tasks like language modeling and knowledge graph construction.
Model: KurBERT 2B
Architecture
• Base Model: BERT-Base
• Layers: 12 Transformer encoder layers
• Hidden Units: 768
• Attention Heads: 12
• Vocabulary: Multidialectal tokenizer trained on KurCorpus 2B
Usage Example
from transformers import AutoTokenizer, AutoModel
# Load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained("YourRepo/KurBERT-2B")
model = AutoModel.from_pretrained("YourRepo/KurBERT-2B")
# Example input
text = "من دڵخۆشم بە زمانی کوردی"
inputs = tokenizer(text, return_tensors="pt")
# Forward pass
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
Applications
KurCorpus 2B and KurBERT 2B support transfer learning for a wide range of tasks:
• Text Classification
• Sentiment Analysis
• Named Entity Recognition (NER)
• Machine Translation
• Knowledge Graph Construction
• Question Answering
创建时间:
2025-08-21



