five

tahrirchi/dilmash

收藏
Hugging Face2024-09-10 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/tahrirchi/dilmash
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: src_lang dtype: string - name: src_sent dtype: string - name: tgt_lang dtype: string - name: tgt_sent dtype: string splits: - name: kaa_eng num_bytes: 19047157 num_examples: 100000 - name: kaa_rus num_bytes: 27731049 num_examples: 100000 - name: kaa_uzb num_bytes: 30608474 num_examples: 100000 download_size: 46148914 dataset_size: 77386680 configs: - config_name: default data_files: - split: kaa_eng path: data/kaa_eng-* - split: kaa_rus path: data/kaa_rus-* - split: kaa_uzb path: data/kaa_uzb-* language: - en - ru - uz - kaa pretty_name: dilmash size_categories: - 100K<n<1M license: mit task_categories: - translation tags: - dilmash - karakalpak --- # Dilmash: Karakalpak Parallel Corpus This repository contains a parallel corpus for the Karakalpak language, developed as part of the research paper "Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak". ## Dataset Description The Karakalpak Parallel Corpus is a collection of 300,000 sentence pairs, designed to support machine translation tasks involving the Karakalpak language. It includes: - Uzbek-Karakalpak (100,000 pairs) - Russian-Karakalpak (100,000 pairs) - English-Karakalpak (100,000 pairs) ## Usage This dataset is intended for training and evaluating machine translation models involving the Karakalpak language. To load and use dataset, run this script: ```python from datasets import load_dataset dilmash_corpus = load_dataset("tahrirchi/dilmash") ``` ## Dataset Structure ### Data Instances - **Size of downloaded dataset files:** 77.4 MB - **Size of the generated dataset:** 46.1 MB - **Total amount of disk used:** 123.5 MB An example of 'kaa_eng' looks as follows. ``` {'src_lang': 'kaa_Latn', 'src_sent': 'Pedagogikalıq ideal balaǵa ıktıyatlılıq penen katnasta bolıw principine bárqulla, úlken hám kishi jumıslarda súyeniwdi talan etedi.', 'tgt_lang': 'eng_Latn', 'tgt_sent': 'The ideal of education demands that the principle of treating children with care be observed at all times, in both big and small matters.' } ``` ### Data Fields The data fields are the same among all splits. - `src_lang`: a `string` feature that contains source language. - `src_sent`: a `string` feature that contains sentence in source language. - `tgt_lang`: a `string` feature that contains target language. - `tgt_sent`: a `string` feature that contains sentence in target language. ### Data Splits | split_name |num_examples| |-----------------|-----------:| | kaa_eng | 100000 | | kaa_rus | 100000 | | kaa_uzb | 100000 | ## Data Sources The corpus comprises diverse parallel texts sourced from multiple domains: - 23% sentences from news sources - 34% sentences from books (novels, non-fiction) - 24% sentences from bilingual dictionaries - 19% sentences from school textbooks Additionally, 4,000 English-Karakalpak pairs were sourced from the Gatitos Project (Jones et al., 2023)[https://aclanthology.org/2023.emnlp-main.26]. ## Data Preparation The data mining process involved local mining techniques, ensuring that parallel sentences were extracted from translations of the same book, document, or article. Sentence alignment was performed using LaBSE (Language-agnostic BERT Sentence Embedding) embeddings. For more information, plase refet to [our paper](https://arxiv.org/abs/2409.04269). ## Citation If you use this dataset in your research, please cite our paper: ```bibtex @misc{mamasaidov2024openlanguagedatainitiative, title={Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak}, author={Mukhammadsaid Mamasaidov and Abror Shopulatov}, year={2024}, eprint={2409.04269}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2409.04269}, } ``` ## Gratitude We are thankful to these awesome organizations and people for helping to make it happen: - [David Dalé](https://daviddale.ru): for advise throughout the process - Perizad Najimova: for expertise and assistance with the Karakalpak language - [Nurlan Pirjanov](https://www.linkedin.com/in/nurlan-pirjanov/): for expertise and assistance with the Karakalpak language - [Atabek Murtazaev](https://www.linkedin.com/in/atabek/): for advise throughout the process - Ajiniyaz Nurniyazov: for advise throughout the process We would also like to express our sincere appreciation to [Google for Startups](https://cloud.google.com/startup) for generously sponsoring the compute resources necessary for our experiments. Their support has been instrumental in advancing our research in low-resource language machine translation. ## Contacts We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Karakalpak. For further development and issues about the dataset, please use m.mamasaidov@tahrirchi.uz or a.shopolatov@tahrirchi.uz to contact.
提供机构:
tahrirchi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作