applied-ai-018/pretraining_v1-omega_v2_multi_lingual
收藏Hugging Face2024-08-06 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/applied-ai-018/pretraining_v1-omega_v2_multi_lingual
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含多种语言的文本数据,涵盖了阿拉伯语、孟加拉语、西班牙语、古吉拉特语、印地语、印度尼西亚语、卡纳达语、克什米尔语、马拉雅拉姆语、马拉地语、尼泊尔语、俄语、僧伽罗语、泰米尔语、泰卢固语和泰语。每个语言版本的数据集都包含一个名为train的分割,其中包含文本数据。数据集的大小和下载大小因语言而异,最大的多语言配置omega_v2_multi_lingual包含了超过2200万条文本数据。
This dataset contains text data in multiple languages, including Arabic, Bengali, Spanish, Gujarati, Hindi, Indonesian, Kannada, Kashmiri, Malayalam, Marathi, Nepali, Russian, Sinhala, Tamil, Telugu, and Thai. Each language version of the dataset includes a split named train containing text data. The size and download size of the dataset vary by language, with the largest multilingual configuration omega_v2_multi_lingual containing over 22 million text entries.
提供机构:
applied-ai-018



