klei1/bleta-sq-dataset-v1
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/klei1/bleta-sq-dataset-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- sq
- en
license: apache-2.0
task_categories:
- text-generation
- question-answering
tags:
- albanian
- alpaca
- instruction-tuning
- bleta
size_categories:
- 10K<n<100K
---
# Bleta SQ Instruct v1
Cleaned instruction-following dataset for Albanian language fine-tuning, used to train the **Bleta** AI assistant.
## Dataset Details
- **Total rows:** 39,873
- **Language:** Albanian (sq)
- **Format:** Alpaca (instruction / input / output)
## Composition
| Split | Rows | Description |
|---|---|---|
| Albanian Alpaca | 38,480 | Cleaned from saillab/alpaca-albanian-cleaned (removed ~12K Afrikaans rows) |
| Bleta Identity | 1,393 | Grammatically correct Albanian identity Q&A for the Bleta assistant |
## Cleaning
- Removed C1 control characters corrupting Albanian ë/ç
- Language-filtered: kept only Albanian (sq) rows, removed ~12,300 Afrikaans rows
- Deduplicated on (instruction, output)
- Bleta identity uses correct feminine grammar throughout
## Usage
```python
from datasets import load_dataset
ds = load_dataset("klei1/bleta-sq-instruct-v1")
```
## Source
- Base: [saillab/alpaca-albanian-cleaned](https://huggingface.co/datasets/saillab/alpaca-albanian-cleaned)
提供机构:
klei1



