rahular/itihasa

Name: rahular/itihasa
Creator: rahular
Published: 2022-10-24 18:06:01
License: 暂无描述

Hugging Face2022-10-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/rahular/itihasa

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - expert-generated language: - sa - en license: - apache-2.0 multilinguality: - translation size_categories: - unknown source_datasets: - original task_categories: - text2text-generation task_ids: [] pretty_name: Itihasa metrics: - bleu - sacrebleu - rouge - ter - chrF tags: - conditional-text-generation --- # Itihāsa Itihāsa is a Sanskrit-English translation corpus containing 93,000 Sanskrit shlokas and their English translations extracted from M. N. Dutt's seminal works on The Rāmāyana and The Mahābhārata. The paper which introduced this dataset can be found [here](https://aclanthology.org/2021.wat-1.22/). This repository contains the randomized train, development, and test sets. The original extracted data can be found [here](https://github.com/rahular/itihasa/tree/gh-pages/res) in JSON format. If you just want to browse the data, you can go [here](http://rahular.com/itihasa/). ## Usage ``` >> from datasets import load_dataset >> dataset = load_dataset("rahular/itihasa") >> dataset DatasetDict({ train: Dataset({ features: ['translation'], num_rows: 75162 }) validation: Dataset({ features: ['translation'], num_rows: 6149 }) test: Dataset({ features: ['translation'], num_rows: 11722 }) }) >> dataset['train'][0] {'translation': {'en': 'The ascetic Vālmīki asked Nārada, the best of sages and foremost of those conversant with words, ever engaged in austerities and Vedic studies.', 'sn': 'ॐ तपः स्वाध्यायनिरतं तपस्वी वाग्विदां वरम्। नारदं परिपप्रच्छ वाल्मीकिर्मुनिपुङ्गवम्॥'}} ``` ## Citation If you found this dataset to be useful, please consider citing the paper as follows: ``` @inproceedings{aralikatte-etal-2021-itihasa, title = "Itihasa: A large-scale corpus for {S}anskrit to {E}nglish translation", author = "Aralikatte, Rahul and de Lhoneux, Miryam and Kunchukuttan, Anoop and S{\o}gaard, Anders", booktitle = "Proceedings of the 8th Workshop on Asian Translation (WAT2021)", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.wat-1.22", pages = "191--197", abstract = "This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.", } ```

提供机构：

rahular

原始信息汇总

数据集概述：Itihāsa

基本信息

名称: Itihāsa
语言:
- 梵文 (sa)
- 英文 (en)
许可证: Apache-2.0
多语言性: 翻译
数据来源: 原始数据
任务类别: 文本到文本生成
评估指标:
- BLEU
- SacreBLEU
- ROUGE
- TER
- ChrF
标签: 条件文本生成

数据集详情

描述: Itihāsa 是一个包含93,000个梵文诗句及其英文翻译的语料库，数据提取自M. N. Dutt关于《罗摩衍那》和《摩诃婆罗多》的经典著作。
数据集结构: 包含随机分配的训练集、开发集和测试集。
数据集大小: 未知

数据集使用示例

python from datasets import load_dataset dataset = load_dataset("rahular/itihasa")

训练集: 75,162行
验证集: 6,149行
测试集: 11,722行

引用信息

@inproceedings{aralikatte-etal-2021-itihasa, title = "Itihasa: A large-scale corpus for {S}anskrit to {E}nglish translation", author = "Aralikatte, Rahul and de Lhoneux, Miryam and Kunchukuttan, Anoop and S{o}gaard, Anders", booktitle = "Proceedings of the 8th Workshop on Asian Translation (WAT2021)", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.wat-1.22", pages = "191--197", abstract = "This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.", }

5,000+

优质数据集

54 个

任务类型

进入经典数据集