Alverciito/wikipedia_articles_es_tokenized
收藏Hugging Face2026-01-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Alverciito/wikipedia_articles_es_tokenized
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-classification
- sentence-similarity
- token-classification
language:
- es
tags:
- wikipedia
- spanish
- es
- segmentation
- sentence-segmentation
- document-segmentation
- sentence-transformer
- long-context
- bpe
- tokenized
- memory-mapped
- nlp
pretty_name: Wikipedia Article Segmentation ES - Tokenized
size_categories:
- 100M<n<1B
---
# Wikipedia Article Segmentation ES — Tokenized Dataset
## Overview
**Wikipedia Article Segmentation ES — Tokenized** is a large-scale, fully tokenized version of the *Wikipedia Article Segmentation ES* dataset.
It is designed for **efficient training of sentence and document segmentation models**, enabling high-throughput access through memory-mapped arrays.
This dataset provides **pre-tokenized inputs, segmentation labels, and masks**, removing the need for on-the-fly tokenization and sentence splitting during training.
---
## Contents
- [Overview](#overview)
- [Dataset Origin](#dataset-origin)
- [Task Categories](#task-categories)
- [Language](#language)
- [License](#license)
- [Dataset Splits](#dataset-splits)
- [Tokenized Dataset Structure](#tokenized-dataset-structure)
- [Metadata (`info.json`)](#metadata-infojson)
- [Example Structure](#example-structure)
- [Derived Attributes](#derived-attributes)
- [`max_sentences`](#max_sentences)
- [`max_tokens`](#max_tokens)
- [Data Fields (Per Sample)](#data-fields-per-sample)
- [Notes](#notes)
- [Loading the Tokenized Dataset](#loading-the-tokenized-dataset)
- [Using a DataLoader](#using-a-dataloader)
- [Output Formats](#output-formats)
- [Design Goals](#design-goals)
- [Reproducibility](#reproducibility)
- [Known Limitations](#known-limitations)
- [Intended Use](#intended-use)
- [Citation](#citation)
- [Author](#author)
---
## Dataset Origin
This tokenized dataset is derived from the base dataset:
**Wikipedia Article Segmentation ES**
The base dataset consists of segmented Spanish Wikipedia articles, where each sample may contain **multiple concatenated articles**, preserving paragraph and sentence structure.
The tokenized version applies:
- Sentence segmentation using SpaCy
- Subword tokenization using a custom-trained **BPE tokenizer**
- Fixed-size padding and masking
- Memory-mapped storage for scalability
---
## Task Categories
- Text segmentation
- Sentence boundary detection
- Long-document modeling
- Text classification
- Sentence similarity
- Document-level representation learning
---
## Language
- Spanish (`es`)
---
## License
- **MIT**
Wikipedia content is redistributed under its original license terms.
---
## Dataset Splits
The dataset is divided into three subsets:
| Split | Name | Description |
|------|------|-------------|
| Train | `wikipedia-es-A000` | 26,510 grouped samples |
| Validation | `wikipedia-es-A001` | 3,336 grouped samples |
| Test | `wikipedia-es-A002` | 6,557 grouped samples |
Each split is tokenized independently using the same tokenizer configuration.
---
## Tokenized Dataset Structure
Each tokenized dataset directory contains:
```text
tokenized_dataset/
├── info.json
├── x.memmap # Tokenized input IDs
├── y.memmap # Sentence boundary labels
├── x_mask.memmap # Attention masks
├── y_mask.memmap # Sentence validity mask
├── y_cand.memmap # Sentence candidate mask
```
All arrays are stored as NumPy memmaps for fast, low-memory access.
## Metadata (`info.json`)
The `info.json` file describes the **layout, data types, and tensor shapes** of the tokenized dataset stored on disk.
It is required to correctly map the memory-mapped arrays and guarantees dataset integrity through a unique fingerprint.
### Example Structure
```json
{
"samples": 26510,
"fingerprint": "...",
"x": {
"name": "x",
"dtype": "int32",
"samples": 26510,
"element_shape": [max_sentences, max_tokens]
},
"y": {
"name": "y",
"dtype": "int8",
"samples": 26510,
"element_shape": [max_sentences]
},
"x_mask": {
"name": "x_mask",
"dtype": "int8",
"samples": 26510,
"element_shape": [max_sentences, max_tokens]
},
"y_mask": {
"name": "y_mask",
"dtype": "int8",
"samples": 26510,
"element_shape": [max_sentences]
},
"y_cand": {
"name": "y_cand",
"dtype": "int8",
"samples": 26510,
"element_shape": [max_sentences]
}
}
```
## Derived Attributes
The following attributes are inferred from the metadata and are consistent across the dataset:
### `max_sentences`
Maximum number of sentences per sample.
Samples with fewer sentences are padded up to this limit.
### `max_tokens`
Maximum number of tokens per sentence.
Sentences longer than this value are truncated.
These fixed dimensions allow efficient batching and fast memory-mapped access.
---
## Data Fields (Per Sample)
Each sample in the tokenized dataset consists of the following tensors:
| Field | Shape | Type | Description |
|------|------|------|-------------|
| `x` | `max_sentences × max_tokens` | `int32` | Tokenized input IDs |
| `x_mask` | `max_sentences × max_tokens` | `int8` | Attention mask for valid tokens |
| `y` | `max_sentences` | `int8` | Sentence boundary labels (1 = boundary) |
| `y_mask` | `max_sentences` | `int8` | Mask indicating valid sentences |
| `y_cand` | `max_sentences` | `int8` | Candidate positions for sentence boundaries |
---
#### Notes
- All arrays are stored as **NumPy memory-mapped files** for efficient disk access.
- Padding positions are always masked out using `x_mask` and `y_mask`.
- `y_cand` restricts boundary prediction to structurally valid positions (e.g. paragraph breaks or article starts).
- The dataset `fingerprint` ensures compatibility between the dataset and the tokenizer configuration.
This structure allows models to reason explicitly about **sentence structure, boundaries, and padding**, while maintaining high training throughput.
---
## Loading the Tokenized Dataset
The tokenized dataset is designed to be loaded directly from disk using memory-mapped arrays.
```python
from src.tokenized_dataset import TokenizedSegmentationDataset
dataset = TokenizedSegmentationDataset(
tokenized_dataset="/path/to/tokenized_dataset",
percentage=1.0,
return_type=dict
)
```
---
## Using a DataLoader
````python
loader = dataset.get_loader(
batch_size=8,
shuffle=True,
num_workers=0
)
````
The loader yields fully prepared tensors, ready to be passed to a model without additional preprocessing.
---
## Output Formats
The dataset supports two output formats, configurable via the return_type parameter.
````python
# Dictionary format (default)
{
"input": x,
"input_mask": x_mask,
"labels": y,
"output_mask": y_mask,
"candidate_mask": y_cand
}
# Tuple format
(x, y, x_mask, y_mask, y_cand)
````
The tuple format is intended for lightweight or legacy training loops.
---
## Design Goals
This tokenized dataset is designed to:
- Remove runtime tokenization and sentence segmentation overhead
- Enable fast iteration over very large datasets
- Support long-context document segmentation models
- Minimize RAM usage through memory mapping
- Provide explicit structural supervision for sentence boundaries
---
## Reproducibility
The dataset is fully reproducible given:
- The same Wikipedia ZIM snapshot
- The same tokenizer configuration
- The same sentence segmentation parameters
- The same random seed
No heuristic filtering is applied beyond sentence segmentation and whitespace normalization, making the dataset suitable for controlled experiments and benchmarking.
---
## Known Limitations
- The number of sentences per sample is capped at `max_sentences`
- Token sequences are truncated to `max_tokens`
- Titles are not included in the tokenized representation
- Internal Wikipedia references are not preserved
- Sentence boundaries are restricted to predefined candidate positions
---
## Intended Use
This dataset is intended for research and development in:
- Sentence and document segmentation
- Boundary detection models
- Long-context language modeling
- Structured document understanding
- Spanish-language NLP benchmarks
---
## Citation
If you use this dataset in academic or research work, please cite:
**Alberto Palomo Alonso**
Universidad de Alcalá — Escuela Politécnica Superior
Spanish Wikipedia (offline ZIM snapshot)
---
## Author
**Alberto Palomo Alonso**
Universidad de Alcalá
Escuela Politécnica Superior
提供机构:
Alverciito



