five

IndicXlit Romanized Dataset

收藏
arXiv2025-09-30 收录
下载链接:
https://github.com/AI4Bharat/romansetu
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集是通过使用IndicXlit模型将网络爬取的印地语语料库音译生成的,旨在对罗马字文本进行LLM(大型语言模型)的持续预训练。此外,该数据集采用罗马化文本格式,旨在提高LLM在印地语处理方面的能力,使其与英语更好地对齐。该数据集规模大约为1亿个单词,其任务是对印地语语言处理进行持续预训练。

This dataset is generated by transliterating web-crawled Hindi corpora using the IndicXlit model, with the aim of conducting continued pre-training of Large Language Models (LLMs) on Romanized text. Additionally, this dataset adopts a Romanized text format to enhance the ability of LLMs in Hindi language processing, enabling better alignment with English. The dataset has a scale of approximately 100 million words, and its task is to conduct continued pre-training for Hindi language processing.
提供机构:
AI4Bharat
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作