Alverciito/wikipedia_articles_es_tokenized

Name: Alverciito/wikipedia_articles_es_tokenized
Creator: Alverciito
Published: 2026-01-11 20:19:59
License: 暂无描述

Hugging Face2026-01-11 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Alverciito/wikipedia_articles_es_tokenized

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-classification - sentence-similarity - token-classification language: - es tags: - wikipedia - spanish - es - segmentation - sentence-segmentation - document-segmentation - sentence-transformer - long-context - bpe - tokenized - memory-mapped - nlp pretty_name: Wikipedia Article Segmentation ES - Tokenized size_categories: - 100M<n<1B --- # Wikipedia Article Segmentation ES — Tokenized Dataset ## Overview **Wikipedia Article Segmentation ES — Tokenized** is a large-scale, fully tokenized version of the *Wikipedia Article Segmentation ES* dataset. It is designed for **efficient training of sentence and document segmentation models**, enabling high-throughput access through memory-mapped arrays. This dataset provides **pre-tokenized inputs, segmentation labels, and masks**, removing the need for on-the-fly tokenization and sentence splitting during training. --- ## Contents - [Overview](#overview) - [Dataset Origin](#dataset-origin) - [Task Categories](#task-categories) - [Language](#language) - [License](#license) - [Dataset Splits](#dataset-splits) - [Tokenized Dataset Structure](#tokenized-dataset-structure) - [Metadata (`info.json`)](#metadata-infojson) - [Example Structure](#example-structure) - [Derived Attributes](#derived-attributes) - [`max_sentences`](#max_sentences) - [`max_tokens`](#max_tokens) - [Data Fields (Per Sample)](#data-fields-per-sample) - [Notes](#notes) - [Loading the Tokenized Dataset](#loading-the-tokenized-dataset) - [Using a DataLoader](#using-a-dataloader) - [Output Formats](#output-formats) - [Design Goals](#design-goals) - [Reproducibility](#reproducibility) - [Known Limitations](#known-limitations) - [Intended Use](#intended-use) - [Citation](#citation) - [Author](#author) --- ## Dataset Origin This tokenized dataset is derived from the base dataset: **Wikipedia Article Segmentation ES** The base dataset consists of segmented Spanish Wikipedia articles, where each sample may contain **multiple concatenated articles**, preserving paragraph and sentence structure. The tokenized version applies: - Sentence segmentation using SpaCy - Subword tokenization using a custom-trained **BPE tokenizer** - Fixed-size padding and masking - Memory-mapped storage for scalability --- ## Task Categories - Text segmentation - Sentence boundary detection - Long-document modeling - Text classification - Sentence similarity - Document-level representation learning --- ## Language - Spanish (`es`) --- ## License - **MIT** Wikipedia content is redistributed under its original license terms. --- ## Dataset Splits The dataset is divided into three subsets: | Split | Name | Description | |------|------|-------------| | Train | `wikipedia-es-A000` | 26,510 grouped samples | | Validation | `wikipedia-es-A001` | 3,336 grouped samples | | Test | `wikipedia-es-A002` | 6,557 grouped samples | Each split is tokenized independently using the same tokenizer configuration. --- ## Tokenized Dataset Structure Each tokenized dataset directory contains: ```text tokenized_dataset/ ├── info.json ├── x.memmap # Tokenized input IDs ├── y.memmap # Sentence boundary labels ├── x_mask.memmap # Attention masks ├── y_mask.memmap # Sentence validity mask ├── y_cand.memmap # Sentence candidate mask ``` All arrays are stored as NumPy memmaps for fast, low-memory access. ## Metadata (`info.json`) The `info.json` file describes the **layout, data types, and tensor shapes** of the tokenized dataset stored on disk. It is required to correctly map the memory-mapped arrays and guarantees dataset integrity through a unique fingerprint. ### Example Structure ```json { "samples": 26510, "fingerprint": "...", "x": { "name": "x", "dtype": "int32", "samples": 26510, "element_shape": [max_sentences, max_tokens] }, "y": { "name": "y", "dtype": "int8", "samples": 26510, "element_shape": [max_sentences] }, "x_mask": { "name": "x_mask", "dtype": "int8", "samples": 26510, "element_shape": [max_sentences, max_tokens] }, "y_mask": { "name": "y_mask", "dtype": "int8", "samples": 26510, "element_shape": [max_sentences] }, "y_cand": { "name": "y_cand", "dtype": "int8", "samples": 26510, "element_shape": [max_sentences] } } ``` ## Derived Attributes The following attributes are inferred from the metadata and are consistent across the dataset: ### `max_sentences` Maximum number of sentences per sample. Samples with fewer sentences are padded up to this limit. ### `max_tokens` Maximum number of tokens per sentence. Sentences longer than this value are truncated. These fixed dimensions allow efficient batching and fast memory-mapped access. --- ## Data Fields (Per Sample) Each sample in the tokenized dataset consists of the following tensors: | Field | Shape | Type | Description | |------|------|------|-------------| | `x` | `max_sentences × max_tokens` | `int32` | Tokenized input IDs | | `x_mask` | `max_sentences × max_tokens` | `int8` | Attention mask for valid tokens | | `y` | `max_sentences` | `int8` | Sentence boundary labels (1 = boundary) | | `y_mask` | `max_sentences` | `int8` | Mask indicating valid sentences | | `y_cand` | `max_sentences` | `int8` | Candidate positions for sentence boundaries | --- #### Notes - All arrays are stored as **NumPy memory-mapped files** for efficient disk access. - Padding positions are always masked out using `x_mask` and `y_mask`. - `y_cand` restricts boundary prediction to structurally valid positions (e.g. paragraph breaks or article starts). - The dataset `fingerprint` ensures compatibility between the dataset and the tokenizer configuration. This structure allows models to reason explicitly about **sentence structure, boundaries, and padding**, while maintaining high training throughput. --- ## Loading the Tokenized Dataset The tokenized dataset is designed to be loaded directly from disk using memory-mapped arrays. ```python from src.tokenized_dataset import TokenizedSegmentationDataset dataset = TokenizedSegmentationDataset( tokenized_dataset="/path/to/tokenized_dataset", percentage=1.0, return_type=dict ) ``` --- ## Using a DataLoader ````python loader = dataset.get_loader( batch_size=8, shuffle=True, num_workers=0 ) ```` The loader yields fully prepared tensors, ready to be passed to a model without additional preprocessing. --- ## Output Formats The dataset supports two output formats, configurable via the return_type parameter. ````python # Dictionary format (default) { "input": x, "input_mask": x_mask, "labels": y, "output_mask": y_mask, "candidate_mask": y_cand } # Tuple format (x, y, x_mask, y_mask, y_cand) ```` The tuple format is intended for lightweight or legacy training loops. --- ## Design Goals This tokenized dataset is designed to: - Remove runtime tokenization and sentence segmentation overhead - Enable fast iteration over very large datasets - Support long-context document segmentation models - Minimize RAM usage through memory mapping - Provide explicit structural supervision for sentence boundaries --- ## Reproducibility The dataset is fully reproducible given: - The same Wikipedia ZIM snapshot - The same tokenizer configuration - The same sentence segmentation parameters - The same random seed No heuristic filtering is applied beyond sentence segmentation and whitespace normalization, making the dataset suitable for controlled experiments and benchmarking. --- ## Known Limitations - The number of sentences per sample is capped at `max_sentences` - Token sequences are truncated to `max_tokens` - Titles are not included in the tokenized representation - Internal Wikipedia references are not preserved - Sentence boundaries are restricted to predefined candidate positions --- ## Intended Use This dataset is intended for research and development in: - Sentence and document segmentation - Boundary detection models - Long-context language modeling - Structured document understanding - Spanish-language NLP benchmarks --- ## Citation If you use this dataset in academic or research work, please cite: **Alberto Palomo Alonso** Universidad de Alcalá — Escuela Politécnica Superior Spanish Wikipedia (offline ZIM snapshot) --- ## Author **Alberto Palomo Alonso** Universidad de Alcalá Escuela Politécnica Superior

提供机构：

Alverciito

5,000+

优质数据集

54 个

任务类型

进入经典数据集