ramachandrajoshi/english-kannada-cleaned
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ramachandrajoshi/english-kannada-cleaned
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
tags:
- machine-translation
- translation
- english
- kannada
language:
- en
- kn
---
# English–Kannada Cleaned
A cleaned parallel corpus of English–Kannada sentence pairs suitable for training and evaluating machine translation models.
- **Languages:** English -> Kannada
- **License:** Apache License 2.0
## Dataset statistics
- Train: 8,00,000 sentence pairs
- Validation: 1,000 sentence pairs
- Test: 1,000 sentence pairs
- Total: 5,02,000 sentence pairs
These counts exclude per-file CSV headers.
## Source and provenance
The dataset is provided as UTF-8 CSV files with two columns: `english_sentences` and `kannada_sentences`. The data appears cleaned for common noisy artifacts and includes sentence-aligned pairs.
Directory layout:
- `train/` — 30 CSV files (`train_part_1.csv` ... `train_part_50.csv`) each with header `english_sentences,kannada_sentences`.
- `validation/val.csv` — validation split with header.
- `test/test.csv` — test split with header.
Example row from `test/test.csv`:
| english_sentences | kannada_sentences |
|---|---|
| No one understood what was going on. | ಏನು ನಡೆಯುತ್ತಿದೆ ಎಂಬುದು ಯಾರಿಗೂ ಅರ್ಥವಾಗಲಿಲ್ಲ. |
## Recommended usage
You can load the dataset locally using the `datasets` library (it will read the CSV files directly):
```python
from datasets import load_dataset
data_files = {
"train": "train/*.csv",
"validation": "validation/val.csv",
"test": "test/test.csv",
}
dataset = load_dataset("csv", data_files=data_files)
# access columns
print(dataset["train"][0])
```
## License
This dataset is released under the Apache License 2.0. See the `LICENSE` file for details.
## Citation
If you use this dataset, please cite it. A `CITATION.cff` is included with suggested metadata.
---
## Acknowledgements
- Thanks to the original CSV provider [damerajee/en-kannada](https://huggingface.co/datasets/damerajee/en-kannada) for sharing the parallel English–Kannada data used as a source for this cleaned dataset.
- Thanks to [NSP](https://www.nspglobaltech.com) and [AI4Bharat](https://ai4bharat.iitm.ac.in) for supporting the creation of this dataset and for providing access to current best open-weight English→Kannada translation LLM models.
---
提供机构:
ramachandrajoshi



