uznlp-uz/uz_proverb
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/uznlp-uz/uz_proverb
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- uz
license: cc-by-4.0
task_categories:
- text-classification
- text-generation
- token-classification
pretty_name: Uzbek Proverbs Dataset (8.5K)
size_categories:
- 1K<n<10K
---
# Uzbek Proverbs Dataset (UZBEKPROVERBS-8.5K)
## 📌 Description
Proverbs are culturally salient figurative expressions, yet Uzbek still lacks standardized NLP resources for their automatic identification. This dataset introduces **UzbekProverbs-8.5K**, a curated, machine-readable proverb resource for Uzbek, derived from a source inventory of 8,514 entries.
The resource contains:
* **8,477 unique proverb strings**
* **2,998 glossed entries**
* **70 normalized thematic groups**
A structured curation pipeline was applied, including:
* label normalization
* duplicate resolution
* canonical identifier assignment
* metadata standardization
These steps address inconsistencies typical in raw proverb collections, such as orthographic variation, uneven annotation coverage, and polythematic duplication.
---
## 🧠 Benchmark Definition
This dataset is released as a **benchmark for automatic proverb identification in Uzbek corpora**, supporting reproducible research in low-resource figurative language processing.
### Supported Tasks
#### 1. Proverb Detection (Classification)
* **Input:** sentence
* **Output:** binary label (contains proverb / does not contain proverb)
#### 2. Proverb Localization (Span Detection)
* **Input:** sentence
* **Output:** span of proverb text within the sentence
#### 3. Canonical Proverb Linking
* **Input:** proverb variant (possibly noisy or inflected form)
* **Output:** canonical proverb entry
---
## 📊 Dataset Structure
The dataset is stored in **TSV format (tab-separated)** with the following columns:
| Column name | Description |
| ------------------------ | --------------------- |
| `id` | Unique identifier |
| `Maqollar` | Uzbek proverb text |
| `Maqollarning guruhlari` | Semantic category |
| `Maqollarning ma'nolari` | Explanation / meaning |
---
## 🧾 Example
```json
{
"id": 1,
"Maqollar": "Ha’ga 'Hu' kelar.",
"Maqollarning guruhlari": "Yaxshilik va yomonlik haqida maqollar",
"Maqollarning ma'nolari": "Yaxshilik yoki yaxshi muomala odatda javobsiz qolmaydi, unga munosib javob qaytadi."
}
```
---
## 🏷 Label Schema
The dataset includes normalized semantic categories such as:
* Yaxshilik va yomonlik
* Donolik va nodonlik
* Mehnatsevarlik va dangasalik
* Do‘stlik va dushmanlik
* Sabrlilik va sabrsizlik
* Adolat va insofsizlik
* Yaxshi so‘z va yomon so‘z
---
## 📏 Evaluation Protocol
The benchmark supports multiple evaluation settings:
### Classification (Detection)
* Accuracy
* Precision / Recall / F1-score
### Span Detection (Localization)
* Exact Match (EM)
* Token-level F1
### Canonical Linking
* Top-1 accuracy
* Retrieval-based metrics (MRR, Recall@k)
---
## 📊 Data Splits
*Note: predefined splits may be added in future versions.*
Recommended split strategy:
* Train: 80%
* Validation: 10%
* Test: 10%
---
## 🎯 Use Cases
This dataset can be used for:
* 🏷 Proverb classification and detection
* 🧠 Semantic similarity and retrieval
* 🤖 Fine-tuning Uzbek language models (BERT, LLaMA, Qwen, etc.)
* 🔍 Information retrieval systems
* 📚 Educational and linguistic research tools
---
## ⚙️ Loading the Dataset
```python
from datasets import load_dataset
dataset = load_dataset("ruhilloalaev/proverb_Uzbek")
print(dataset["train"][0])
```
---
## ⚠️ Notes
* Data is in **Uzbek (Latin script)**
* File uses **tab (`\t`) as delimiter**
* Proverbs may exhibit:
* morphological variation
* orthographic inconsistency
* figurative ambiguity
---
## 📜 License
This dataset is released under the **CC-BY-4.0 License**.
提供机构:
uznlp-uz



