ramachandrajoshi/english-kannada-cleaned

Name: ramachandrajoshi/english-kannada-cleaned
Creator: ramachandrajoshi
Published: 2026-04-09 09:09:01
License: 暂无描述

Hugging Face2026-04-09 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/ramachandrajoshi/english-kannada-cleaned

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 tags: - machine-translation - translation - english - kannada language: - en - kn --- # English–Kannada Cleaned A cleaned parallel corpus of English–Kannada sentence pairs suitable for training and evaluating machine translation models. - **Languages:** English -> Kannada - **License:** Apache License 2.0 ## Dataset statistics - Train: 8,00,000 sentence pairs - Validation: 1,000 sentence pairs - Test: 1,000 sentence pairs - Total: 5,02,000 sentence pairs These counts exclude per-file CSV headers. ## Source and provenance The dataset is provided as UTF-8 CSV files with two columns: `english_sentences` and `kannada_sentences`. The data appears cleaned for common noisy artifacts and includes sentence-aligned pairs. Directory layout: - `train/` — 30 CSV files (`train_part_1.csv` ... `train_part_50.csv`) each with header `english_sentences,kannada_sentences`. - `validation/val.csv` — validation split with header. - `test/test.csv` — test split with header. Example row from `test/test.csv`: | english_sentences | kannada_sentences | |---|---| | No one understood what was going on. | ಏನು ನಡೆಯುತ್ತಿದೆ ಎಂಬುದು ಯಾರಿಗೂ ಅರ್ಥವಾಗಲಿಲ್ಲ. | ## Recommended usage You can load the dataset locally using the `datasets` library (it will read the CSV files directly): ```python from datasets import load_dataset data_files = { "train": "train/*.csv", "validation": "validation/val.csv", "test": "test/test.csv", } dataset = load_dataset("csv", data_files=data_files) # access columns print(dataset["train"][0]) ``` ## License This dataset is released under the Apache License 2.0. See the `LICENSE` file for details. ## Citation If you use this dataset, please cite it. A `CITATION.cff` is included with suggested metadata. --- ## Acknowledgements - Thanks to the original CSV provider [damerajee/en-kannada](https://huggingface.co/datasets/damerajee/en-kannada) for sharing the parallel English–Kannada data used as a source for this cleaned dataset. - Thanks to [NSP](https://www.nspglobaltech.com) and [AI4Bharat](https://ai4bharat.iitm.ac.in) for supporting the creation of this dataset and for providing access to current best open-weight English→Kannada translation LLM models. ---

提供机构：

ramachandrajoshi

5,000+

优质数据集

54 个

任务类型

进入经典数据集