systemk/culturay-10M
收藏Hugging Face2024-11-29 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/systemk/culturay-10M
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含多种语言的文本数据,每种语言版本都有其特定的配置,包括阿拉伯语(ar)、孟加拉语(bn)、德语(de)、英语(en)、西班牙语(es)、法语(fr)、印地语(hi)、印度尼西亚语(id)、日语(ja)、马拉地语(mr)、葡萄牙语(pt)、俄语(ru)、斯瓦希里语(sw)、乌尔都语(ur)和中文(zh)。每个配置包含id、文档语言、评分、语言列表、文本内容、URL和集合等特征。数据集主要用于训练模型,每个语言版本的数据量从数百万到数千万不等。
This dataset contains text data in multiple languages, with specific configurations for each language version, including Arabic (ar), Bengali (bn), German (de), English (en), Spanish (es), French (fr), Hindi (hi), Indonesian (id), Japanese (ja), Marathi (mr), Portuguese (pt), Russian (ru), Swahili (sw), Urdu (ur), and Chinese (zh). Each configuration includes features such as id, document language, scores, language list, text content, URL, and collection. The dataset is primarily used for training models, with data volumes ranging from several million to tens of millions for each language version.
提供机构:
systemk



