Dovud-Asadov/uzbek-embedding-dataset
收藏Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Dovud-Asadov/uzbek-embedding-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- uz
tags:
- embedding
- sentence-similarity
- information-retrieval
- uzbek
- e5
license: apache-2.0
task_categories:
- sentence-similarity
- feature-extraction
size_categories:
- 10K<n<100K
---
# Uzbek Embedding Dataset
40,338 Uzbek query-passage triplets for training text embedding / retrieval models. Generated from Uzbek news articles (kun.uz and other sources).
## Dataset Structure
| Split | Rows |
|:------|:-----|
| train | 38,321 |
| test | 2,017 |
### Fields
| Field | Description |
|:------|:------------|
| `query` | Uzbek question/search query |
| `positive` | Relevant passage that answers the query |
| `negative_1` | Hard negative passage (topically similar but not relevant) |
| `negative_2` | Hard negative passage (may be empty) |
| `negative_3` | Hard negative passage (may be empty) |
| `source_url` | Source article URL |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("Dovud-Asadov/uzbek-embedding-dataset")
print(ds["train"][0])
```
## Intended Use
Training and evaluating Uzbek text embedding models for semantic search and information retrieval. Used to fine-tune [Dovud-Asadov/e5-uz-v3](https://huggingface.co/Dovud-Asadov/e5-uz-v3).
提供机构:
Dovud-Asadov



