pablo-moreira/wikipedia-pt
收藏Hugging Face2023-10-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/pablo-moreira/wikipedia-pt
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: '20231001'
features:
- name: id
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2150584347
num_examples: 1857355
download_size: 0
dataset_size: 2150584347
- config_name: latest
features:
- name: id
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2150584347
num_examples: 1857355
download_size: 0
dataset_size: 2150584347
configs:
- config_name: '20231001'
data_files:
- split: train
path: 20231001/train-*
- config_name: latest
data_files:
- split: train
path: latest/train-*
---
# Dataset Card for Wikipedia - Portuguese
## Dataset Description
- latest
- 20231001
## Usage
```python
from datasets import load_dataset
dataset = load_dataset('pablo-moreira/wikipedia-pt', 'latest')
#dataset = load_dataset('pablo-moreira/wikipedia-pt', '20231001')
```
## Extractor
Notebook with the code for extracting documents from the Wikipedia dump based on the code from the FastAI NLP introduction course.
[Notebook](extractor.ipynb)
## Links
- **[Wikipedia dumps](https://dumps.wikimedia.org/)**
- **[A Code-First Intro to Natural Language Processing](https://github.com/fastai/course-nlp)**
- **[Extractor Code](https://github.com/fastai/course-nlp/blob/master/nlputils.py)**
提供机构:
pablo-moreira
原始信息汇总
数据集概述
数据集配置
-
配置名称: 20231001
- 特征:
- id: 字符串类型
- title: 字符串类型
- text: 字符串类型
- 分割:
- train:
- 字节数: 2150584347
- 样本数: 1857355
- train:
- 下载大小: 0
- 数据集大小: 2150584347
- 特征:
-
配置名称: latest
- 特征:
- id: 字符串类型
- title: 字符串类型
- text: 字符串类型
- 分割:
- train:
- 字节数: 2150584347
- 样本数: 1857355
- train:
- 下载大小: 0
- 数据集大小: 2150584347
- 特征:
数据文件
-
配置名称: 20231001
- 数据文件:
- 分割: train
- 路径: 20231001/train-*
- 数据文件:
-
配置名称: latest
- 数据文件:
- 分割: train
- 路径: latest/train-*
- 数据文件:



