sharmapurushottam9/iitb-english-hindi
收藏Hugging Face2026-03-29 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/sharmapurushottam9/iitb-english-hindi
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- hi
---
<p align="center"><img src="https://huggingface.co/datasets/cfilt/HiNER-collapsed/raw/main/cfilt-dark-vec.png" alt="Computation for Indian Language Technology Logo" width="150" height="150"/></p>
# IITB-English-Hindi Parallel Corpus
[](https://creativecommons.org/licenses/by-nc/4.0/)
[](https://twitter.com/cfiltnlp)
[](https://twitter.com/PeopleCentredAI)
## About
The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources and corpora developed at the Center for Indian Language Technology, IIT Bombay over the years. This page describes the corpus. This corpus has been used at the Workshop on Asian Language Translation Shared Task since 2016 the Hindi-to-English and English-to-Hindi languages pairs and as a pivot language pair for the Hindi-to-Japanese and Japanese-to-Hindi language pairs.
The complete details of this corpus are available at [this URL](https://www.cfilt.iitb.ac.in/iitb_parallel/). We also provide this parallel corpus via browser download from the same URL. We also provide a monolingual Hindi corpus on the same URL.
### Recent Updates
* Version 3.1 - December 2021 - Added 49,400 sentence pairs to the parallel corpus.
* Version 3.0 - August 2020 - Added ~47,000 sentence pairs to the parallel corpus.
## Usage
We provide a notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository. The notebook also shows how to segment the corpus using BPE tokenization which can be used to train an English-Hindi MT System.
[https://github.com/cfiltnlp/IITB-English-Hindi-PC](https://github.com/cfiltnlp/IITB-English-Hindi-PC)
## Other
You can find a catalogue of other English-Hindi and other Indian language parallel corpora here: [Indic NLP Catalog](https://github.com/indicnlpweb/indicnlp_catalog)
## Maintainer(s)
[Diptesh Kanojia](https://dipteshkanojia.github.io)<br/>
Shivam Mhasker<br/>
## Citation
If you use this corpus or its derivate resources for your research, kindly cite it as follows:
Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya. The IIT Bombay English-Hindi Parallel Corpus. Language Resources and Evaluation Conference. 2018.
### BiBTeX Citation
```latex
@inproceedings{kunchukuttan-etal-2018-iit,
title = "The {IIT} {B}ombay {E}nglish-{H}indi Parallel Corpus",
author = "Kunchukuttan, Anoop and
Mehta, Pratik and
Bhattacharyya, Pushpak",
booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",
month = may,
year = "2018",
address = "Miyazaki, Japan",
publisher = "European Language Resources Association (ELRA)",
url = "https://aclanthology.org/L18-1548",
}
```
---
language:
- 英语
- 印地语
---
<p align="center"><img src="https://huggingface.co/datasets/cfilt/HiNER-collapsed/raw/main/cfilt-dark-vec.png" alt="印度语言技术计算中心标识" width="150" height="150"/></p>
# IITB英印双语平行语料库
[](https://creativecommons.org/licenses/by-nc/4.0/)
[](https://twitter.com/cfiltnlp)
[](https://twitter.com/PeopleCentredAI)
## 关于本语料库
本印度理工学院孟买分校英印双语语料库包含英印双语平行语料库,以及从多种现有来源收集的单语印地语语料库,同时涵盖印度理工学院孟买分校印度语言技术中心(Center for Indian Language Technology, CFILT)多年来开发的各类语料。本页面即对该语料库的详细说明。自2016年起,该语料库便被用于亚洲语言翻译共享任务研讨会(Workshop on Asian Language Translation Shared Task),作为印英、英印双向翻译的基准语料,同时也作为印日、日印双向翻译的枢轴语言语料库。
该语料库的完整详细信息可访问[此链接](https://www.cfilt.iitb.ac.in/iitb_parallel/)获取。我们同时支持通过该链接的浏览器界面下载该平行语料库与单语印地语语料库。
### 近期更新
* 3.1版本(2021年12月):为平行语料库新增49,400个句对。
* 3.0版本(2020年8月):为平行语料库新增约47,000个句对。
## 使用方法
我们提供了一份示例笔记,演示如何从Hugging Face数据集仓库导入IITB英印双语平行语料库,同时展示如何使用字节对编码(Byte Pair Encoding, BPE)分词对语料库进行预处理,以用于训练英印双语机器翻译(Machine Translation, MT)系统。
[https://github.com/cfiltnlp/IITB-English-Hindi-PC](https://github.com/cfiltnlp/IITB-English-Hindi-PC)
## 其他资源
你可通过以下链接获取更多英印双语及其他印度语言的平行语料库目录:[印度语言自然语言处理目录(Indic NLP Catalog)](https://github.com/indicnlpweb/indicnlp_catalog)
## 维护人员
[迪普特什·卡诺吉亚(Diptesh Kanojia)](https://dipteshkanojia.github.io)<br/>
希瓦姆·马斯克尔(Shivam Mhasker)<br/>
## 引用方式
若您在研究中使用本语料库或其衍生资源,请按以下方式引用:
阿诺普·昆库图坦(Anoop Kunchukuttan)、普拉蒂克·梅塔(Pratik Mehta)、普什帕克·巴塔查里亚(Pushpak Bhattacharyya). IIT孟买分校英印双语平行语料库[C]//第十一届国际语言资源与评价会议(Language Resources and Evaluation Conference, LREC 2018)论文集. 2018.
### BiBTeX引用格式
latex
@inproceedings{kunchukuttan-etal-2018-iit,
title = "The {IIT} {B}ombay {E}nglish-{H}indi Parallel Corpus",
author = "Kunchukuttan, Anoop and
Mehta, Pratik and
Bhattacharyya, Pushpak",
booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",
month = may,
year = "2018",
address = "Miyazaki, Japan",
publisher = "European Language Resources Association (ELRA)",
url = "https://aclanthology.org/L18-1548",
}
提供机构:
sharmapurushottam9



