five

NNEngine/Gutenberg-Clean-40M

收藏
Hugging Face2026-01-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/NNEngine/Gutenberg-Clean-40M
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-classification language: - en tags: - text-generation - causal - training - transformers - pytorch - jsonl - segmentation - validation size_categories: - 10M<n<100M --- # 📚 TinyWay-Gutenberg-Clean-40M A large-scale, high-quality English text dataset derived from Project Gutenberg, cleaned, normalized, deduplicated, and segmented into fixed-length samples for efficient language model pretraining. This dataset is designed to support training small and medium language models such as **TinyWay**, tokenizer training, embedding models, and large-scale NLP experimentation. --- ## Dataset Overview * **Name:** TinyWay-Gutenberg-Clean-40M * **Samples:** ~40,000,000 * **Language:** English * **Format:** JSONL (optionally gzip-compressed) * **Source:** Project Gutenberg (public domain books) * **License:** Public Domain * **Intended Use:** Language model pretraining, tokenizer training, representation learning Each line in the dataset contains a clean text segment between **30 and 60 words**. --- ## Data Format Each record is stored as a JSON object: ```json { "id": "twg_000000000123", "text": "Cleaned text segment of natural English language between thirty and sixty words.", "word_count": 42, "source": "gutenberg" } ``` ### Fields | Field | Description | | ------------ | ----------------------------- | | `id` | Unique sample identifier | | `text` | Clean English text segment | | `word_count` | Number of words in the sample | | `source` | Data source identifier | --- ## Data Processing Pipeline The dataset was generated using a fully streaming pipeline to ensure scalability and low memory usage. ### Steps 1. **Streaming Input** * Data loaded from a Project Gutenberg mirror using Hugging Face streaming APIs. 2. **Text Cleaning** * Removed Gutenberg headers and footers * Removed chapter titles and page numbers * Normalized whitespace and line breaks * Removed non-ASCII and control characters * Removed URLs and artifacts 3. **Segmentation** * Text split into fixed segments of **30–60 words**. 4. **Validation** * Enforced word count constraints * Filtered short or malformed segments 5. **Deduplication** * Exact hash-based deduplication applied during generation. 6. **Output** * Stored as JSONL files (optionally gzip-compressed). * Sharded for easier distribution and loading. --- ## How to Load the Dataset ### Using Hugging Face Datasets ```python from datasets import load_dataset dataset = load_dataset( "NNEngine/TinyWay-Gutenberg-Clean-40M", split="train", streaming=True ) for sample in dataset.take(3): print(sample) ``` --- ### Reading JSONL Manually ```python import json with open("data/train-00000.jsonl", "r", encoding="utf-8") as f: for _ in range(3): print(json.loads(next(f))) ``` If files are compressed: ```python import gzip import json with gzip.open("train-00000.jsonl.gz", "rt", encoding="utf-8") as f: for _ in range(3): print(json.loads(next(f))) ``` --- ## Dataset Characteristics Approximate properties: * **Average words per sample:** ~45 * **Vocabulary:** Large natural English vocabulary * **Style:** Literary and narrative English * **Domain:** Fiction, non-fiction, historical texts --- ## Limitations * Content is primarily literary and historical in nature. * No conversational, chat, or code data. * Some archaic vocabulary and sentence structure may appear. * Deduplication is hash-based (near-duplicates may remain). For conversational or modern web text, additional datasets should be mixed. --- ## License All source texts originate from Project Gutenberg and are in the **public domain**. This processed dataset is released for unrestricted research and commercial use. --- ## Citation If you use this dataset in research or publications, please cite: ``` TinyWay-Gutenberg-Clean-40M NNEngine, 2026 ``` --- ## 🧠 Maintainer Created and maintained by **Shivam Sharma**
提供机构:
NNEngine
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作