INDIC-MARCO
收藏arXiv2023-12-15 更新2024-07-24 收录
下载链接:
https://github.com/saifulhaq95/IndicIRSuite
下载链接
链接失效反馈官方服务:
资源简介:
INDIC-MARCO是由印度理工学院孟买分校计算机科学与工程系创建的多语言数据集,旨在支持11种印度语言的神经信息检索研究。该数据集包含约880万篇文档、100万条查询和3900万训练三元组,覆盖了从阿萨姆语到泰卢固语的广泛语言。数据集的创建过程涉及使用机器翻译技术将原始的MSMARCO数据集翻译成目标语言,确保了数据的高质量和多样性。INDIC-MARCO的应用领域主要集中在提升非英语语言的信息检索技术,特别是在资源较少的印度语言中,为相关研究和应用提供了重要的数据支持。
INDIC-MARCO is a multilingual dataset developed by the Department of Computer Science and Engineering, Indian Institute of Technology Bombay, designed to support neural information retrieval research for 11 Indian languages. It contains approximately 8.8 million documents, 1 million queries, and 39 million training triples, covering a broad spectrum of languages spanning from Assamese to Telugu. The dataset was constructed by translating the original MSMARCO dataset into target languages using machine translation technologies, ensuring high data quality and diversity. The primary application focus of INDIC-MARCO lies in advancing information retrieval technologies for non-English languages, especially low-resource Indian languages, providing critical data support for relevant research and applications.
提供机构:
印度理工学院孟买分校计算机科学与工程系
创建时间:
2023-12-15
原始信息汇总
IndicIRSuite: Multilingual Dataset and Neural Information Models for Indian Languages
数据集链接
模型链接
主要贡献者
- Saiful Haq
- Ashutosh Sharma
- Pushpak Bhattacharyya
引用信息
@article{haq2023indicirsuite, title={IndicIRSuite: Multilingual Dataset and Neural Information Models for Indian Languages}, author={Haq, Saiful and Sharma, Ashutosh and Bhattacharyya, Pushpak}, journal={arXiv preprint arXiv:2312.09508}, year={2023} }
语言代码与语言映射
- asm_Beng: Assamese Language
- ben_Beng: Bengali Language
- guj_Gujr: Gujarati Language
- hin_Deva: Hindi Language
- kan_Knda: Kannada Language
- mal_Mlym: Malyalam Language
- mar_Deva: Marathi Language
- ory_Orya: Oriya Language
- pan_Guru: Punjabi Language
- tam_Taml: Tamil Language
- tel_Telu: Telugu Language



