nickoo004/kaa-parallel-corpus
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/nickoo004/kaa-parallel-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- kaa
- en
license: apache-2.0
task_categories:
- translation
- text-generation
language_creators:
- crowdsourced
- found
multilinguality:
- translation
pretty_name: Karakalpak-English Parallel Corpus
tags:
- karakalpak
- qaraqalpaq
- central-asian
- parallel-corpus
- turkic-languages
- kara-kalpak
configs:
- config_name: kaa_Cyrl
data_files: "kaa_Cyrl/train-*.parquet"
- config_name: kaa_Latn
data_files: "kaa_Latn/train-*.parquet"
---
# Kaa Karakalpak-English Parallel Corpus (FineTranslations)
## 📌 Overview
This repository contains a high-quality, curated parallel corpus for the **Karakalpak (kaa)** language, paired with **English (en)**. Karakalpak is a low-resource Turkic language spoken primarily in the Republic of Karakalpakstan.
This dataset is a specialized subset extracted from the massive **[HuggingFaceFW/finetranslations](https://huggingface.co/datasets/HuggingFaceFW/finetranslations)** project. The goal of this repo is to provide a dedicated and easy-to-access resource for researchers and developers focusing specifically on Karakalpak NLP.
## 🗂️ Dataset Structure
The corpus is organized into two distinct configurations based on the writing systems used in Karakalpakstan:
| Subset | Script | Count | Description |
|---|---|---|---|
| `kaa_Cyrl` | **Cyrillic** | 10,880 rows | Official script used in many formal and academic contexts. |
| `kaa_Latn` | **Latin** | 3,181 rows | The modern script increasingly used in education and digital media. |
### Column Descriptions:
- **`id`**: Unique identifier for the pair.
- **`translated_text`**: The high-quality **English** translation.
- **`og_full_text`**: The original **Karakalpak** sentence (Target).
- **`og_language`**: The specific script tag (`kaa_Cyrl` or `kaa_Latn`).
- **`og_quality_score`**: Quality metric from the base dataset.
- **`edu_score`**: Educational value score (higher means better content quality).
- **`url`**: Source web address.
---
## 🚀 Usage
You can load the dataset using the Hugging Face `datasets` library:
### 1. Load Cyrillic Data
```python
from datasets import load_dataset
dataset = load_dataset("nickoo004/kaa-parallel-corpus", "kaa_Cyrl", split="train")
print(dataset[0])
```
### 2. Load Latin Data
```python
dataset_latn = load_dataset("nickoo004/kaa-parallel-corpus", "kaa_Latn", split="train")
```
### 3. Quick Format for Training (MT)
If you need a simple `en-kaa` format for training models like NLLB or T5:
```python
def format_data(example):
return {
"english": example["translated_text"],
"karakalpak": example["og_full_text"]
}
clean_ds = dataset.map(format_data, remove_columns=dataset.column_names)
```
---
## 🛠️ Data Origin & Quality
The data was collected and processed by the **Hugging Face FineData Team** as part of the FineTranslations effort. The pipeline included:
1. **Crawling:** Extracting Karakalpak text from diverse web sources (Wikipedia, government sites, news).
2. **Filtering:** Removing low-quality or non-Karakalpak content using advanced language identification.
3. **Translation:** Generating high-saliency English translations using state-of-the-art translation models.
4. **Scoring:** Annotating rows with educational and quality scores to allow for better training data selection.
## 📜 License & Citation
This dataset is released under the **Apache-2.0 License**.
If you use this dataset, please credit the original source:
```bibtex
@software{finetranslations2024,
author = {Hugging Face FineData Team},
title = {FineTranslations: A Large-Scale High-Quality Parallel Corpus},
year = {2024},
publisher = {Hugging Face},
journal = {Hugging Face Repository},
howpublished = {\url{https://huggingface.co/datasets/HuggingFaceFW/finetranslations}}
}
```
---
**Maintained by:** nickoo004
**Contact:** nursultankoshekbaev477@gmail.com
提供机构:
nickoo004



