dimakarp1996/YaTURK-7lang
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/dimakarp1996/YaTURK-7lang
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ru
- en
- tt
- ba
- kk
- ky
- cv
size_categories:
- 1M<n<10M
task_categories:
- translation
- text-generation
pretty_name: yaturk-7lang
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: some_translations_are_identical
path: data/some_translations_are_identical-*
dataset_info:
features:
- name: 'Unnamed: 0'
dtype: int64
- name: russian
dtype: large_string
- name: bashkir
dtype: large_string
- name: kazakh
dtype: large_string
- name: tatar
dtype: large_string
- name: kyrgyz
dtype: large_string
- name: chuvash
dtype: large_string
- name: english
dtype: large_string
- name: pair
dtype: large_string
- name: source
dtype: large_string
- name: set_from_source
dtype: large_string
- name: train_on
dtype: int64
- name: only_index1
dtype: int64
splits:
- name: train
num_bytes: 7390628078
num_examples: 6614051
- name: some_translations_are_identical
num_bytes: 95279488
num_examples: 170861
download_size: 3713176743
dataset_size: 7485907566
---
# YaTURK-7lang
This dataset was used in the research paper [No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data](https://huggingface.co/papers/2602.04442).
Dataset used for the online competition series "Machine Translation for Low-Resource Turkic Languages". This dataset was generated via Yandex.Translate.
## 📢 News – March 5, 2026
**The dataset has been updated!**
- Additional **186729 sentence pairs** were added to the dataset. Obtained from [there](https://chuv.cap.ru/news), these pairs were translated from Chuvash to all the languages in the dataset. Thanks a lot to Nikolay Ivanov!
- From these pairs, 184286 pairs were added to the `train` split, and 2,443 pairs to the `some_translations_are_identical` split.
- Also, duplicates were removed from the dataset - 118206 pairs from the `train` split and 8979 pairs from the `some_translations_are_identical` split. Only fully identical rows were counted as duplicates.
## 📢 News – March 3, 2026
**The dataset has been updated!**
- All translations are now available for **every language pair** where they were missing at the time of original creation.
- These missing translations were obtained from the Chuvash side using **Yandex.Translate**, following the same methodology as described in the paper.
- Additionally, approximately **177k sentence pairs** where the data overlapped in at least two language pairs have been moved to a separate split named **`some_translations_are_identical`**.
- The main split (default) is now called **`train`**, and it contains the remaining data.
---
## Dataset Structure
Two splits are available:
| Split name | Description |
|------------------------------------|----------------------------------------------------------------------------------------------------------|
| `train` | The default split, containing 6614051 sentence pairs (after removing overlapping pairs). |
| `some_translations_are_identical` | 170861 sentence pairs where at least two language versions are identical (e.g., due to transliteration or borrowing). |
---
## Column Description
- **russian, english, tatar, bashkir, kazakh, kyrgyz, chuvash**: phrases in the corresponding languages.
- **source**: Hugging Face dataset name (or URL) from which the phrase was taken.
- **set_from_source**: the specific subset/split within the source dataset.
- **train_on**: `1` if this sample was used for fine‑tuning in the original work, otherwise `0`.
- **pair**: language pair of the original dataset (or the only language if the source was monolingual).
- **only_index1**:
- `1` → this sample was **not** used in the final English–Chuvash data index (i.e., it was only used for other language pairs).
- `0` → it **was** used in the final English–Chuvash index.
## How to load the dataset
```python
from datasets import load_dataset
dataset = load_dataset(
"dimakarp1996/YaTURK-7lang",
revision="v3.0" # or 'v2.0' or 'v1.0' if you want a previous version
)
```
## Notes on Data Availability
- Since the March 3, 2026 update (revision `v2.0`) , **all language pairs in the dataset are complete – no missing translations remain**.
- Before the March 3, 2026 update, for samples with `only_index1=1`, translations to **all seven languages** were available. For samples with `only_index1=0`, only English, Tatar, and Chuvash translations were guaranteed to exist (Russian also existed if it originated from Alex Antonov's Russian–Chuvash corpus). These data are still available as revision `v1.0`.
---
If you use this dataset, please cite the original paper:
```bibtex
@misc{NoOneSizeFitsAll2026,
title={No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data},
author={Dmitry Karpov},
year={2026},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CL}
提供机构:
dimakarp1996



