Indo4B-Plus

Name: Indo4B-Plus
Creator: 香港科技大学
Published: 2021-10-10 00:58:54
License: 暂无描述

arXiv2021-10-10 更新2024-06-21 收录

下载链接：

https://github.com/indobenchmark/indonlg

下载链接

链接失效反馈

官方服务：

资源简介：

Indo4B-Plus是一个包含印尼语、爪哇语和巽他语的大型预训练数据集，由香港科技大学等机构创建。该数据集主要用于自然语言生成模型的预训练，涵盖了超过36亿个单词，数据来源于维基百科和Common Crawl等。创建过程包括数据收集、预处理和平衡，旨在解决低资源语言的自然语言生成问题，特别是在印尼语、爪哇语和巽他语等广泛使用的语言中。

Indo4B-Plus is a large-scale pre-training dataset encompassing Indonesian, Javanese and Sundanese languages, developed by institutions including the Hong Kong University of Science and Technology. Primarily designed for pre-training natural language generation models, this dataset contains over 3.6 billion words, with data sourced from platforms such as Wikipedia and Common Crawl. Its development workflow includes data collection, preprocessing and balancing, aiming to address natural language generation challenges for low-resource languages, especially widely used ones like Indonesian, Javanese and Sundanese.

提供机构：

香港科技大学

创建时间：

2021-04-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集