BadalY453/iitb-english-hindi

Name: BadalY453/iitb-english-hindi
Creator: BadalY453
Published: 2026-02-16 19:38:07
License: 暂无描述

Hugging Face2026-02-16 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/BadalY453/iitb-english-hindi

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - hi --- <img src="https://huggingface.co/datasets/cfilt/HiNER-collapsed/raw/main/cfilt-dark-vec.png" alt="Computation for Indian Language Technology Logo" width="150" height="150"/> # IITB-English-Hindi Parallel Corpus [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/) [![Twitter Follow](https://img.shields.io/twitter/follow/cfiltnlp?color=1DA1F2&logo=twitter&style=flat-square)](https://twitter.com/cfiltnlp) [![Twitter Follow](https://img.shields.io/twitter/follow/PeopleCentredAI?color=1DA1F2&logo=twitter&style=flat-square)](https://twitter.com/PeopleCentredAI) ## About The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources and corpora developed at the Center for Indian Language Technology, IIT Bombay over the years. This page describes the corpus. This corpus has been used at the Workshop on Asian Language Translation Shared Task since 2016 the Hindi-to-English and English-to-Hindi languages pairs and as a pivot language pair for the Hindi-to-Japanese and Japanese-to-Hindi language pairs. The complete details of this corpus are available at [this URL](https://www.cfilt.iitb.ac.in/iitb_parallel/). We also provide this parallel corpus via browser download from the same URL. We also provide a monolingual Hindi corpus on the same URL. ### Recent Updates * Version 3.1 - December 2021 - Added 49,400 sentence pairs to the parallel corpus. * Version 3.0 - August 2020 - Added ~47,000 sentence pairs to the parallel corpus. ## Usage We provide a notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository. The notebook also shows how to segment the corpus using BPE tokenization which can be used to train an English-Hindi MT System. [https://github.com/cfiltnlp/IITB-English-Hindi-PC](https://github.com/cfiltnlp/IITB-English-Hindi-PC) ## Other You can find a catalogue of other English-Hindi and other Indian language parallel corpora here: [Indic NLP Catalog](https://github.com/indicnlpweb/indicnlp_catalog) ## Maintainer(s) [Diptesh Kanojia](https://dipteshkanojia.github.io) Shivam Mhasker ## Citation If you use this corpus or its derivate resources for your research, kindly cite it as follows: Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya. The IIT Bombay English-Hindi Parallel Corpus. Language Resources and Evaluation Conference. 2018. ### BiBTeX Citation ```latex @inproceedings{kunchukuttan-etal-2018-iit, title = "The {IIT} {B}ombay {E}nglish-{H}indi Parallel Corpus", author = "Kunchukuttan, Anoop and Mehta, Pratik and Bhattacharyya, Pushpak", booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)", month = may, year = "2018", address = "Miyazaki, Japan", publisher = "European Language Resources Association (ELRA)", url = "https://aclanthology.org/L18-1548", } ```

language: - 英语 - 印地语 <img src="https://huggingface.co/datasets/cfilt/HiNER-collapsed/raw/main/cfilt-dark-vec.png" alt="印度语言技术研究中心（Computation for Indian Language Technology）标识" width="150" height="150"/> # IITB英印语平行语料库（IITB-English-Hindi Parallel Corpus） [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/) [![Twitter Follow](https://img.shields.io/twitter/follow/cfiltnlp?color=1DA1F2&logo=twitter&style=flat-square)](https://twitter.com/cfiltnlp) [![Twitter Follow](https://img.shields.io/twitter/follow/PeopleCentredAI?color=1DA1F2&logo=twitter&style=flat-square)](https://twitter.com/PeopleCentredAI) ## 项目概述 IITB英印语语料库包含英印语平行语料库，以及从多种现有来源收集的单语印地语语料，还有多年来由印度理工学院孟买分校（Indian Institute of Technology Bombay，简称IITB）印度语言技术研究中心（Center for Indian Language Technology）开发的语料。本页面即对该语料库进行说明。自2016年起，该语料库便被用于亚洲语言翻译研讨会共享任务中的英印语、印英语语言对翻译，同时也作为枢轴语言对应用于印日语、日印语语言对的翻译任务。该语料库的完整细节可访问[此链接](https://www.cfilt.iitb.ac.in/iitb_parallel/)获取。我们也可通过该链接的浏览器下载功能获取此平行语料库，同时该页面也提供单语印地语语料的下载。 ### 近期更新 * 3.1版本（2021年12月）：向平行语料库新增49,400个句对。 * 3.0版本（2020年8月）：向平行语料库新增约47,000个句对。 ## 使用方法我们提供了一份Jupyter Notebook，演示如何从HuggingFace数据集仓库中导入IITB英印语平行语料库，同时还展示了如何使用字节对编码（Byte Pair Encoding，简称BPE）分词法对语料库进行分词，该分词流程可用于训练英印语机器翻译（Machine Translation，简称MT）系统。相关Notebook地址：[https://github.com/cfiltnlp/IITB-English-Hindi-PC](https://github.com/cfiltnlp/IITB-English-Hindi-PC) ## 其他资源你可在此处获取更多英印语及其他印度语言平行语料库的目录：[印度语言自然语言处理目录（Indic NLP Catalog）](https://github.com/indicnlpweb/indicnlp_catalog) ## 维护人员 [迪普特什·卡诺吉亚（Diptesh Kanojia）](https://dipteshkanojia.github.io) 希瓦姆·马斯克尔（Shivam Mhasker） ## 引用说明若您在研究中使用该语料库或其衍生资源，请按以下方式引用： Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya. 《IIT孟买分校英印语平行语料库》. 语言资源与评估会议（Language Resources and Evaluation Conference）. 2018. ### BiBTeX 引用 latex @inproceedings{kunchukuttan-etal-2018-iit, title = "The {IIT} {B}ombay {E}nglish-{H}indi Parallel Corpus", author = "Kunchukuttan, Anoop and Mehta, Pratik and Bhattacharyya, Pushpak", booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)", month = may, year = "2018", address = "Miyazaki, Japan", publisher = "European Language Resources Association (ELRA)", url = "https://aclanthology.org/L18-1548", }

提供机构：

BadalY453

5,000+

优质数据集

54 个

任务类型

进入经典数据集