NNEngine/Gutenberg-Clean

Name: NNEngine/Gutenberg-Clean
Creator: NNEngine
Published: 2026-01-24 13:43:16
License: 暂无描述

Hugging Face2026-01-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/NNEngine/Gutenberg-Clean

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-classification - question-answering - text-generation language: - en size_categories: - 10M<n<100M --- # 📚 TinyWay-Gutenberg-Clean (Compressed Shards) A large-scale, high-quality English text dataset derived from Project Gutenberg. The corpus has been cleaned, normalized, deduplicated, segmented into fixed-length samples, and stored as compressed JSONL shards for efficient large-scale language model training. This dataset is intended for pretraining and experimentation with small and medium language models such as **TinyWay**, tokenizer training, and large-scale NLP research. --- ## 📦 Dataset Overview * **Name:** TinyWay-Gutenberg-Clean * **Current Release:** ~19 compressed shards (`.jsonl.gz`) * **Estimated Samples:** Tens of millions of text segments * **Language:** English * **Format:** Gzip-compressed JSON Lines (`.jsonl.gz`) * **Source:** Project Gutenberg (public domain books) * **License:** Public Domain * **Maintainer:** Shivam (NNEngine / ITM AIR Lab) Each record contains a clean text segment between **30 and 60 words**. Future releases will scale this dataset further (e.g., 100M+ samples). --- ## Data Format Each line is a JSON object: ```json { "id": "twg_000000012345", "text": "Cleaned natural English text segment between thirty and sixty words.", "word_count": 42, "source": "gutenberg" } ``` ### Fields | Field | Description | | ------------ | ------------------------------ | | `id` | Unique sample identifier | | `text` | Clean English text segment | | `word_count` | Number of words in the segment | | `source` | Data source identifier | --- ## Data Processing Pipeline The dataset was generated using a fully streaming pipeline to ensure scalability and low memory usage. ### Processing Steps 1. **Streaming Input** * Text streamed from a Project Gutenberg mirror on Hugging Face. 2. **Text Cleaning** * Removed Gutenberg headers and footers. * Removed chapter titles, page numbers, and boilerplate text. * Normalized whitespace and line breaks. * Removed non-ASCII and control characters. * Filtered malformed or extremely short segments. 3. **Segmentation** * Text segmented into chunks of **30–60 words**. 4. **Validation** * Enforced word count limits. * Filtered invalid or noisy segments. 5. **Deduplication** * Exact hash-based deduplication applied during generation. 6. **Compression & Sharding** * Data stored as `.jsonl.gz` shards for efficient disk usage and streaming. --- ## How to Load the Dataset ### Using Hugging Face Datasets (Streaming) ```python from datasets import load_dataset dataset = load_dataset( "NNEngine/TinyWay-Gutenberg-Clean", split="train", streaming=True ) for i, sample in enumerate(dataset): print(sample) if i == 3: break ``` --- ### Reading a Shard Manually ```python import gzip import json with gzip.open("train-00000.jsonl.gz", "rt", encoding="utf-8") as f: for _ in range(3): print(json.loads(next(f))) ``` --- ## Dataset Characteristics (Approximate) * **Average words per sample:** ~45 * **Style:** Literary and narrative English * **Domain:** Fiction, non-fiction, historical texts * **Vocabulary:** Large natural English vocabulary * **Compression:** ~60–70% size reduction vs raw JSONL Exact statistics may vary per shard and will be expanded in future releases. --- ## Limitations * Primarily literary and historical language. * No conversational chat data. * No code or structured technical documentation. * Some archaic vocabulary and sentence structures may appear. * Deduplication is hash-based (near-duplicates may remain). For conversational or web-style language modeling, this dataset should be mixed with complementary corpora. --- ## License All source texts originate from Project Gutenberg and are in the **public domain**. This processed dataset is released for unrestricted research and commercial use. --- ## Versioning & Roadmap Planned future updates: - Larger releases (target: 100M+ samples) - Improved deduplication (near-duplicate filtering) - Dataset statistics and analytics - Additional language normalization Each major release will be versioned clearly. --- ## Citation If you use this dataset in research or publications, please cite: ``` TinyWay-Gutenberg-Clean Shivam (NNEngine), 2026 ```

提供机构：

NNEngine

5,000+

优质数据集

54 个

任务类型

进入经典数据集