five

ramachandrajoshi/english-kannada-cleaned

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ramachandrajoshi/english-kannada-cleaned
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 tags: - machine-translation - translation - english - kannada language: - en - kn --- # English–Kannada Cleaned A cleaned parallel corpus of English–Kannada sentence pairs suitable for training and evaluating machine translation models. - **Languages:** English -> Kannada - **License:** Apache License 2.0 ## Dataset statistics - Train: 8,00,000 sentence pairs - Validation: 1,000 sentence pairs - Test: 1,000 sentence pairs - Total: 5,02,000 sentence pairs These counts exclude per-file CSV headers. ## Source and provenance The dataset is provided as UTF-8 CSV files with two columns: `english_sentences` and `kannada_sentences`. The data appears cleaned for common noisy artifacts and includes sentence-aligned pairs. Directory layout: - `train/` — 30 CSV files (`train_part_1.csv` ... `train_part_50.csv`) each with header `english_sentences,kannada_sentences`. - `validation/val.csv` — validation split with header. - `test/test.csv` — test split with header. Example row from `test/test.csv`: | english_sentences | kannada_sentences | |---|---| | No one understood what was going on. | ಏನು ನಡೆಯುತ್ತಿದೆ ಎಂಬುದು ಯಾರಿಗೂ ಅರ್ಥವಾಗಲಿಲ್ಲ. | ## Recommended usage You can load the dataset locally using the `datasets` library (it will read the CSV files directly): ```python from datasets import load_dataset data_files = { "train": "train/*.csv", "validation": "validation/val.csv", "test": "test/test.csv", } dataset = load_dataset("csv", data_files=data_files) # access columns print(dataset["train"][0]) ``` ## License This dataset is released under the Apache License 2.0. See the `LICENSE` file for details. ## Citation If you use this dataset, please cite it. A `CITATION.cff` is included with suggested metadata. --- ## Acknowledgements - Thanks to the original CSV provider [damerajee/en-kannada](https://huggingface.co/datasets/damerajee/en-kannada) for sharing the parallel English–Kannada data used as a source for this cleaned dataset. - Thanks to [NSP](https://www.nspglobaltech.com) and [AI4Bharat](https://ai4bharat.iitm.ac.in) for supporting the creation of this dataset and for providing access to current best open-weight English→Kannada translation LLM models. ---
提供机构:
ramachandrajoshi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作