five

rahular/itihasa

收藏
Hugging Face2022-10-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/rahular/itihasa
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language_creators: - expert-generated language: - sa - en license: - apache-2.0 multilinguality: - translation size_categories: - unknown source_datasets: - original task_categories: - text2text-generation task_ids: [] pretty_name: Itihasa metrics: - bleu - sacrebleu - rouge - ter - chrF tags: - conditional-text-generation --- # Itihāsa Itihāsa is a Sanskrit-English translation corpus containing 93,000 Sanskrit shlokas and their English translations extracted from M. N. Dutt's seminal works on The Rāmāyana and The Mahābhārata. The paper which introduced this dataset can be found [here](https://aclanthology.org/2021.wat-1.22/). This repository contains the randomized train, development, and test sets. The original extracted data can be found [here](https://github.com/rahular/itihasa/tree/gh-pages/res) in JSON format. If you just want to browse the data, you can go [here](http://rahular.com/itihasa/). ## Usage ``` >> from datasets import load_dataset >> dataset = load_dataset("rahular/itihasa") >> dataset DatasetDict({ train: Dataset({ features: ['translation'], num_rows: 75162 }) validation: Dataset({ features: ['translation'], num_rows: 6149 }) test: Dataset({ features: ['translation'], num_rows: 11722 }) }) >> dataset['train'][0] {'translation': {'en': 'The ascetic Vālmīki asked Nārada, the best of sages and foremost of those conversant with words, ever engaged in austerities and Vedic studies.', 'sn': 'ॐ तपः स्वाध्यायनिरतं तपस्वी वाग्विदां वरम्। नारदं परिपप्रच्छ वाल्मीकिर्मुनिपुङ्गवम्॥'}} ``` ## Citation If you found this dataset to be useful, please consider citing the paper as follows: ``` @inproceedings{aralikatte-etal-2021-itihasa, title = "Itihasa: A large-scale corpus for {S}anskrit to {E}nglish translation", author = "Aralikatte, Rahul and de Lhoneux, Miryam and Kunchukuttan, Anoop and S{\o}gaard, Anders", booktitle = "Proceedings of the 8th Workshop on Asian Translation (WAT2021)", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.wat-1.22", pages = "191--197", abstract = "This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.", } ```
提供机构:
rahular
原始信息汇总

数据集概述:Itihāsa

基本信息

  • 名称: Itihāsa
  • 语言:
    • 梵文 (sa)
    • 英文 (en)
  • 许可证: Apache-2.0
  • 多语言性: 翻译
  • 数据来源: 原始数据
  • 任务类别: 文本到文本生成
  • 评估指标:
    • BLEU
    • SacreBLEU
    • ROUGE
    • TER
    • ChrF
  • 标签: 条件文本生成

数据集详情

  • 描述: Itihāsa 是一个包含93,000个梵文诗句及其英文翻译的语料库,数据提取自M. N. Dutt关于《罗摩衍那》和《摩诃婆罗多》的经典著作。
  • 数据集结构: 包含随机分配的训练集、开发集和测试集。
  • 数据集大小: 未知

数据集使用示例

python from datasets import load_dataset dataset = load_dataset("rahular/itihasa")

  • 训练集: 75,162行
  • 验证集: 6,149行
  • 测试集: 11,722行

引用信息

@inproceedings{aralikatte-etal-2021-itihasa, title = "Itihasa: A large-scale corpus for {S}anskrit to {E}nglish translation", author = "Aralikatte, Rahul and de Lhoneux, Miryam and Kunchukuttan, Anoop and S{o}gaard, Anders", booktitle = "Proceedings of the 8th Workshop on Asian Translation (WAT2021)", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.wat-1.22", pages = "191--197", abstract = "This work introduces Itihasa, a large-scale translation dataset containing 93,000 pairs of Sanskrit shlokas and their English translations. The shlokas are extracted from two Indian epics viz., The Ramayana and The Mahabharata. We first describe the motivation behind the curation of such a dataset and follow up with empirical analysis to bring out its nuances. We then benchmark the performance of standard translation models on this corpus and show that even state-of-the-art transformer architectures perform poorly, emphasizing the complexity of the dataset.", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作