IndicXlit Romanized Dataset

Name: IndicXlit Romanized Dataset
Creator: AI4Bharat
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://github.com/AI4Bharat/romansetu

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是通过使用IndicXlit模型将网络爬取的印地语语料库音译生成的，旨在对罗马字文本进行LLM（大型语言模型）的持续预训练。此外，该数据集采用罗马化文本格式，旨在提高LLM在印地语处理方面的能力，使其与英语更好地对齐。该数据集规模大约为1亿个单词，其任务是对印地语语言处理进行持续预训练。

This dataset is generated by transliterating web-crawled Hindi corpora using the IndicXlit model, with the aim of conducting continued pre-training of Large Language Models (LLMs) on Romanized text. Additionally, this dataset adopts a Romanized text format to enhance the ability of LLMs in Hindi language processing, enabling better alignment with English. The dataset has a scale of approximately 100 million words, and its task is to conduct continued pre-training for Hindi language processing.

提供机构：

AI4Bharat

5,000+

优质数据集

54 个

任务类型

进入经典数据集