rahular/itihasa
收藏数据集概述:Itihāsa
基本信息
- 名称: Itihāsa
- 语言:
- 梵文 (sa)
- 英文 (en)
- 许可证: Apache-2.0
- 多语言性: 翻译
- 数据来源: 原始数据
- 任务类别: 文本到文本生成
- 评估指标:
- BLEU
- SacreBLEU
- ROUGE
- TER
- ChrF
- 标签: 条件文本生成
数据集详情
- 描述: Itihāsa 是一个包含93,000个梵文诗句及其英文翻译的语料库,数据提取自M. N. Dutt关于《罗摩衍那》和《摩诃婆罗多》的经典著作。
- 数据集结构: 包含随机分配的训练集、开发集和测试集。
- 数据集大小: 未知
数据集使用示例
python from datasets import load_dataset dataset = load_dataset("rahular/itihasa")
- 训练集: 75,162行
- 验证集: 6,149行
- 测试集: 11,722行
引用信息
@inproceedings{aralikatte-etal-2021-itihasa, title = "Itihasa: A large-scale corpus for {S}anskrit to {E}nglish translation", author = "Aralikatte, Rahul and de Lhoneux, Miryam and Kunchukuttan, Anoop and S{o}gaard, Anders", booktitle = "Proceedings of the 8th Workshop on Asian Translation (WAT2021)", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.wat-1.22", pages = "191--197", abstract = "This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.", }



