mbelitsky/wikipedia_subset
收藏Hugging Face2024-05-14 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/mbelitsky/wikipedia_subset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
splits:
- name: lt
num_bytes: 120236308.42625372
num_examples: 80000
- name: ru
num_bytes: 407217564.5930235
num_examples: 80000
- name: cs
num_bytes: 227017023.46623126
num_examples: 80000
- name: en
num_bytes: 244616657.28749305
num_examples: 80000
- name: fr
num_bytes: 244069992.19385442
num_examples: 80000
- name: de
num_bytes: 263252038.40146655
num_examples: 80000
- name: zh
num_bytes: 152004693.59045833
num_examples: 80000
- name: sw
num_bytes: 67056096.0
num_examples: 78587
- name: ar
num_bytes: 191851997.10302076
num_examples: 80000
- name: hi
num_bytes: 311105691.1087539
num_examples: 80000
- name: fa
num_bytes: 142407735.48300844
num_examples: 80000
- name: ur
num_bytes: 151052595.50146386
num_examples: 80000
- name: es
num_bytes: 254703297.61481243
num_examples: 80000
download_size: 1513185018
dataset_size: 2776591690.76984
configs:
- config_name: default
data_files:
- split: lt
path: data/lt-*
- split: ru
path: data/ru-*
- split: cs
path: data/cs-*
- split: en
path: data/en-*
- split: fr
path: data/fr-*
- split: de
path: data/de-*
- split: zh
path: data/zh-*
- split: sw
path: data/sw-*
- split: ar
path: data/ar-*
- split: hi
path: data/hi-*
- split: fa
path: data/fa-*
- split: ur
path: data/ur-*
- split: es
path: data/es-*
---
提供机构:
mbelitsky
原始信息汇总
数据集概述
数据集特征
- 名称: text
- 数据类型: string
数据集分割
- 名称: lt, ru, cs, en, fr, de, zh, sw, ar, hi, fa, ur, es
- 示例数量: 每个分割包含80000个示例,除了sw分割包含78587个示例。
- 字节数:
- lt: 120236308.42625372
- ru: 407217564.5930235
- cs: 227017023.46623126
- en: 244616657.28749305
- fr: 244069992.19385442
- de: 263252038.40146655
- zh: 152004693.59045833
- sw: 67056096.0
- ar: 191851997.10302076
- hi: 311105691.1087539
- fa: 142407735.48300844
- ur: 151052595.50146386
- es: 254703297.61481243
数据集大小
- 下载大小: 1513185018
- 数据集总大小: 2776591690.76984
配置文件
- 配置名称: default
- 数据文件路径:
- lt: data/lt-*
- ru: data/ru-*
- cs: data/cs-*
- en: data/en-*
- fr: data/fr-*
- de: data/de-*
- zh: data/zh-*
- sw: data/sw-*
- ar: data/ar-*
- hi: data/hi-*
- fa: data/fa-*
- ur: data/ur-*
- es: data/es-*



