nlp4j/wikipedia
收藏Hugging Face2023-11-15 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/nlp4j/wikipedia
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- no-annotation
language:
- ja
license:
- cc-by-sa-3.0
source_datasets:
- original
pretty_name: Wikipedia
config_names:
- 20230101.ja
configs:
- config_name: 20230101.ja
data_files:
- split: train
path: 20230101.ja/train-*
- config_name: 20230101.ja.type0
data_files:
- split: train
path: 20230101.ja.type0/train-*
- config_name: 20230101.ja.type1
data_files:
- split: train
path: 20230101.ja.type1/train-*
- config_name: 20230801.ja.type1
data_files:
- split: train
path: 20230801.ja.type1/train-*
- config_name: 20230901.ja.type1
data_files:
- split: train
path: 20230901.ja.type1/train-*
- config_name: 20231001.ja.type1
data_files:
- split: train
path: 20231001.ja.type1/train-*
- config_name: 20231101.ja.type1
data_files:
- split: train
path: 20231101.ja.type1/train-*
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
- config_name: 20230101.ja
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5445627558
num_examples: 2192693
download_size: 3016211435
dataset_size: 5445627558
- config_name: 20230101.ja.type0
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: wikitext
dtype: string
splits:
- name: train
num_bytes: 12897936907
num_examples: 2192693
download_size: 6648740055
dataset_size: 12897936907
- config_name: 20230101.ja.type1
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5445627558
num_examples: 2192693
download_size: 3016211435
dataset_size: 5445627558
- config_name: 20230801.ja.type1
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5578527799
num_examples: 2237531
download_size: 3089288079
dataset_size: 5578527799
- config_name: 20230901.ja.type1
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5595772816
num_examples: 2243408
download_size: 3099146546
dataset_size: 5595772816
- config_name: 20231001.ja.type1
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5616001418
num_examples: 2246589
download_size: 3109672199
dataset_size: 5616001418
- config_name: 20231101.ja.type1
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5636247958
num_examples: 2252320
download_size: 3120907128
dataset_size: 5636247958
- config_name: default
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5132551154
num_examples: 2192693
download_size: 2888006523
dataset_size: 5132551154
---
# Dataset Card for "wikipedia"
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
nlp4j
原始信息汇总
数据集概述
基本信息
- 数据集名称: Wikipedia
- 语言: 日语 (ja)
- 许可证: CC-BY-SA-3.0
- 数据来源: 原始数据
配置信息
配置 20230101.ja
- 特征:
id: 字符串url: 字符串title: 字符串text: 字符串
- 分割:
train:- 字节数: 5445627558
- 样本数: 2192693
- 下载大小: 3016211435 字节
- 数据集大小: 5445627558 字节
配置 20230101.ja.type0
- 特征:
id: 字符串url: 字符串title: 字符串wikitext: 字符串
- 分割:
train:- 字节数: 12897936907
- 样本数: 2192693
- 下载大小: 6648740055 字节
- 数据集大小: 12897936907 字节
配置 20230101.ja.type1
- 特征:
id: 字符串url: 字符串title: 字符串text: 字符串
- 分割:
train:- 字节数: 5445627558
- 样本数: 2192693
- 下载大小: 3016211435 字节
- 数据集大小: 5445627558 字节
配置 20230801.ja.type1
- 特征:
id: 字符串url: 字符串title: 字符串text: 字符串
- 分割:
train:- 字节数: 5578527799
- 样本数: 2237531
- 下载大小: 3089288079 字节
- 数据集大小: 5578527799 字节
配置 20230901.ja.type1
- 特征:
id: 字符串url: 字符串title: 字符串text: 字符串
- 分割:
train:- 字节数: 5595772816
- 样本数: 2243408
- 下载大小: 3099146546 字节
- 数据集大小: 5595772816 字节
配置 20231001.ja.type1
- 特征:
id: 字符串url: 字符串title: 字符串text: 字符串
- 分割:
train:- 字节数: 5616001418
- 样本数: 2246589
- 下载大小: 3109672199 字节
- 数据集大小: 5616001418 字节
配置 20231101.ja.type1
- 特征:
id: 字符串url: 字符串title: 字符串text: 字符串
- 分割:
train:- 字节数: 5636247958
- 样本数: 2252320
- 下载大小: 3120907128 字节
- 数据集大小: 5636247958 字节
配置 default
- 特征:
id: 字符串url: 字符串title: 字符串text: 字符串
- 分割:
train:- 字节数: 5132551154
- 样本数: 2192693
- 下载大小: 2888006523 字节
- 数据集大小: 5132551154 字节



