uznlp-uz/uz_proverb

Name: uznlp-uz/uz_proverb
Creator: uznlp-uz
Published: 2026-03-19 10:49:34
License: 暂无描述

Hugging Face2026-03-19 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/uznlp-uz/uz_proverb

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - uz license: cc-by-4.0 task_categories: - text-classification - text-generation - token-classification pretty_name: Uzbek Proverbs Dataset (8.5K) size_categories: - 1K<n<10K --- # Uzbek Proverbs Dataset (UZBEKPROVERBS-8.5K) ## 📌 Description Proverbs are culturally salient figurative expressions, yet Uzbek still lacks standardized NLP resources for their automatic identification. This dataset introduces **UzbekProverbs-8.5K**, a curated, machine-readable proverb resource for Uzbek, derived from a source inventory of 8,514 entries. The resource contains: * **8,477 unique proverb strings** * **2,998 glossed entries** * **70 normalized thematic groups** A structured curation pipeline was applied, including: * label normalization * duplicate resolution * canonical identifier assignment * metadata standardization These steps address inconsistencies typical in raw proverb collections, such as orthographic variation, uneven annotation coverage, and polythematic duplication. --- ## 🧠 Benchmark Definition This dataset is released as a **benchmark for automatic proverb identification in Uzbek corpora**, supporting reproducible research in low-resource figurative language processing. ### Supported Tasks #### 1. Proverb Detection (Classification) * **Input:** sentence * **Output:** binary label (contains proverb / does not contain proverb) #### 2. Proverb Localization (Span Detection) * **Input:** sentence * **Output:** span of proverb text within the sentence #### 3. Canonical Proverb Linking * **Input:** proverb variant (possibly noisy or inflected form) * **Output:** canonical proverb entry --- ## 📊 Dataset Structure The dataset is stored in **TSV format (tab-separated)** with the following columns: | Column name | Description | | ------------------------ | --------------------- | | `id` | Unique identifier | | `Maqollar` | Uzbek proverb text | | `Maqollarning guruhlari` | Semantic category | | `Maqollarning ma'nolari` | Explanation / meaning | --- ## 🧾 Example ```json { "id": 1, "Maqollar": "Ha’ga 'Hu' kelar.", "Maqollarning guruhlari": "Yaxshilik va yomonlik haqida maqollar", "Maqollarning ma'nolari": "Yaxshilik yoki yaxshi muomala odatda javobsiz qolmaydi, unga munosib javob qaytadi." } ``` --- ## 🏷 Label Schema The dataset includes normalized semantic categories such as: * Yaxshilik va yomonlik * Donolik va nodonlik * Mehnatsevarlik va dangasalik * Do‘stlik va dushmanlik * Sabrlilik va sabrsizlik * Adolat va insofsizlik * Yaxshi so‘z va yomon so‘z --- ## 📏 Evaluation Protocol The benchmark supports multiple evaluation settings: ### Classification (Detection) * Accuracy * Precision / Recall / F1-score ### Span Detection (Localization) * Exact Match (EM) * Token-level F1 ### Canonical Linking * Top-1 accuracy * Retrieval-based metrics (MRR, Recall@k) --- ## 📊 Data Splits *Note: predefined splits may be added in future versions.* Recommended split strategy: * Train: 80% * Validation: 10% * Test: 10% --- ## 🎯 Use Cases This dataset can be used for: * 🏷 Proverb classification and detection * 🧠 Semantic similarity and retrieval * 🤖 Fine-tuning Uzbek language models (BERT, LLaMA, Qwen, etc.) * 🔍 Information retrieval systems * 📚 Educational and linguistic research tools --- ## ⚙️ Loading the Dataset ```python from datasets import load_dataset dataset = load_dataset("ruhilloalaev/proverb_Uzbek") print(dataset["train"][0]) ``` --- ## ⚠️ Notes * Data is in **Uzbek (Latin script)** * File uses **tab (`\t`) as delimiter** * Proverbs may exhibit: * morphological variation * orthographic inconsistency * figurative ambiguity --- ## 📜 License This dataset is released under the **CC-BY-4.0 License**.

提供机构：

uznlp-uz

5,000+

优质数据集

54 个

任务类型

进入经典数据集