dimakarp1996/YaTURK-7lang

Name: dimakarp1996/YaTURK-7lang
Creator: dimakarp1996
Published: 2026-03-24 12:21:23
License: 暂无描述

Hugging Face2026-03-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/dimakarp1996/YaTURK-7lang

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ru - en - tt - ba - kk - ky - cv size_categories: - 1M<n<10M task_categories: - translation - text-generation pretty_name: yaturk-7lang configs: - config_name: default data_files: - split: train path: data/train-* - split: some_translations_are_identical path: data/some_translations_are_identical-* dataset_info: features: - name: 'Unnamed: 0' dtype: int64 - name: russian dtype: large_string - name: bashkir dtype: large_string - name: kazakh dtype: large_string - name: tatar dtype: large_string - name: kyrgyz dtype: large_string - name: chuvash dtype: large_string - name: english dtype: large_string - name: pair dtype: large_string - name: source dtype: large_string - name: set_from_source dtype: large_string - name: train_on dtype: int64 - name: only_index1 dtype: int64 splits: - name: train num_bytes: 7390628078 num_examples: 6614051 - name: some_translations_are_identical num_bytes: 95279488 num_examples: 170861 download_size: 3713176743 dataset_size: 7485907566 --- # YaTURK-7lang This dataset was used in the research paper [No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data](https://huggingface.co/papers/2602.04442). Dataset used for the online competition series "Machine Translation for Low-Resource Turkic Languages". This dataset was generated via Yandex.Translate. ## 📢 News – March 5, 2026 **The dataset has been updated!** - Additional **186729 sentence pairs** were added to the dataset. Obtained from [there](https://chuv.cap.ru/news), these pairs were translated from Chuvash to all the languages in the dataset. Thanks a lot to Nikolay Ivanov! - From these pairs, 184286 pairs were added to the `train` split, and 2,443 pairs to the `some_translations_are_identical` split. - Also, duplicates were removed from the dataset - 118206 pairs from the `train` split and 8979 pairs from the `some_translations_are_identical` split. Only fully identical rows were counted as duplicates. ## 📢 News – March 3, 2026 **The dataset has been updated!** - All translations are now available for **every language pair** where they were missing at the time of original creation. - These missing translations were obtained from the Chuvash side using **Yandex.Translate**, following the same methodology as described in the paper. - Additionally, approximately **177k sentence pairs** where the data overlapped in at least two language pairs have been moved to a separate split named **`some_translations_are_identical`**. - The main split (default) is now called **`train`**, and it contains the remaining data. --- ## Dataset Structure Two splits are available: | Split name | Description | |------------------------------------|----------------------------------------------------------------------------------------------------------| | `train` | The default split, containing 6614051 sentence pairs (after removing overlapping pairs). | | `some_translations_are_identical` | 170861 sentence pairs where at least two language versions are identical (e.g., due to transliteration or borrowing). | --- ## Column Description - **russian, english, tatar, bashkir, kazakh, kyrgyz, chuvash**: phrases in the corresponding languages. - **source**: Hugging Face dataset name (or URL) from which the phrase was taken. - **set_from_source**: the specific subset/split within the source dataset. - **train_on**: `1` if this sample was used for fine‑tuning in the original work, otherwise `0`. - **pair**: language pair of the original dataset (or the only language if the source was monolingual). - **only_index1**: - `1` → this sample was **not** used in the final English–Chuvash data index (i.e., it was only used for other language pairs). - `0` → it **was** used in the final English–Chuvash index. ## How to load the dataset ```python from datasets import load_dataset dataset = load_dataset( "dimakarp1996/YaTURK-7lang", revision="v3.0" # or 'v2.0' or 'v1.0' if you want a previous version ) ``` ## Notes on Data Availability - Since the March 3, 2026 update (revision `v2.0`) , **all language pairs in the dataset are complete – no missing translations remain**. - Before the March 3, 2026 update, for samples with `only_index1=1`, translations to **all seven languages** were available. For samples with `only_index1=0`, only English, Tatar, and Chuvash translations were guaranteed to exist (Russian also existed if it originated from Alex Antonov's Russian–Chuvash corpus). These data are still available as revision `v1.0`. --- If you use this dataset, please cite the original paper: ```bibtex @misc{NoOneSizeFitsAll2026, title={No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data}, author={Dmitry Karpov}, year={2026}, eprint={}, archivePrefix={arXiv}, primaryClass={cs.CL}

提供机构：

dimakarp1996

5,000+

优质数据集

54 个

任务类型

进入经典数据集