Novokshanov/ru-ky-synthetic-loresmt2026
收藏Hugging Face2026-01-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Novokshanov/ru-ky-synthetic-loresmt2026
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: ky
dtype: string
- name: ru
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 6079345347
num_examples: 1623456
- name: validation
num_bytes: 6213898
num_examples: 1951
download_size: 3073031653
dataset_size: 6085559245
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
license: cc-by-nc-4.0
task_categories:
- translation
language:
- ky
- ru
tags:
- synthetic
- low-resource
source_datasets:
- HuggingFaceFW/fineweb-2
- SiberiaSoft/SiberianPersonaChat
size_categories:
- 1M<n<10M
---
# Russian-Kyrgyz Synthetic Parallel Corpus (LoResMT 2026)
A synthetic parallel corpus for Russian-Kyrgyz machine translation, created for the LoResMT 2026 Turkic Languages Translation Challenge.
## Dataset Description
This dataset contains synthetic Russian-Kyrgyz parallel sentences generated by translating:
1. **FineWeb-2 Kyrgyz data** → Russian (back-translation using Gemma3-27B and Qwen3-235B)
2. **SiberianPersonaChat dialogues** → Kyrgyz (translation using GPT-4o)
All data has been filtered for quality using various heuristics, SONAR scores and MADLAD-400 loss-based filtering. See our paper for full details: tbd (LoResMT @ EACL 2026).
## Data Fields
```python
{
"ru": str, # Russian text
"ky": str, # Kyrgyz text
"source": str, # Original data source
}
```
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("Novokshanov/ru-ky-synthetic-loresmt2026")
```
## Citation
```bibtex
tbd
```
## License
CC-BY-NC-4.0 (Creative Commons Attribution-NonCommercial 4.0)
提供机构:
Novokshanov



