five

kd13/bookcorpus-clean

收藏
Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/kd13/bookcorpus-clean
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation - text-classification - token-classification - question-answering - zero-shot-classification - summarization - feature-extraction - fill-mask - sentence-similarity language: - en tags: - NLP pretty_name: bookcorpus size_categories: - 10M<n<100M --- # BookCorpus — Cleaned for Pre-training LLMs A cleaned, deduplicated, document-segmented version of [`SamuelYang/bookcorpus`](https://huggingface.co/datasets/SamuelYang/bookcorpus) ## TL;DR | Property | Value | |---|---| | Rows (sentences) | **33,649,142** | | Documents (books) | **4,086** | | Format | CSV, 3 columns: `doc_id`, `sent_id`, `text` | | Language | English (lowercased) | | Source | `SamuelYang/bookcorpus` (74,004,228 raw rows) | ## Schema | Column | Type | Description | |---|---|---| | `doc_id` | int | Inferred document/book identifier. Sentences sharing the same `doc_id` come from the same book. | | `sent_id` | int | Sentence position within its document (0-indexed). Preserves original order. | | `text` | string | Cleaned sentence text (lowercased, normalized). | ## How to use it ### Quick load ```python from datasets import load_dataset ds = load_dataset("kd13/bookcorpus-clean", split="train") print(ds[0]) # {'doc_id': 0, 'sent_id': 0, 'text': 'i wish i had a better answer ...'} ``` ## Cleaning pipeline Applied in this order to the source dataset: 1. **Unicode + whitespace normalization** — NFKC normalization, collapse consecutive whitespace, strip. 2. **Document segmentation** — since the source is a flat stream of sentences without book IDs, document boundaries are inferred from telltale markers at the start of books: - ISBN lines (e.g. `isbn : 1492913731`) - Copyright declarations (`copyright 2013 ...`) - `all rights reserved` - `chapter 1` 3. **Line-level filters** — sentences are dropped if they: - have fewer than **20** or more than **1000** characters - match boilerplate patterns (copyright, ISBN, "all rights reserved") - have an alphabetic-character ratio below **0.6** - have a digit ratio above **0.3** - contain no alphabetic characters 4. **Language filter** — cheap English stop-word ratio check (≥ 5% of tokens must be in a small English stop-word set; short lines pass through). 5. **Within-document exact dedup** — SHA-1 hashing drops repeated sentences inside the same book (e.g. recurring chapter headers, section dividers). Note: dedup is *not* applied globally — sentences like "he nodded." occur legitimately across many books. 6. **Document filter** — books with fewer than **8** surviving sentences are dropped (not enough context for NSP). 7. **Cross-document near-duplicate removal** — a SHA-1 fingerprint of each document's first 5 sentences identifies same-book re-uploads; duplicates are dropped. ## Cleaning statistics | Metric | Value | |---|---| | Raw rows (sentences) in source | 74,004,228 | | Documents detected | 6,779 | | Documents kept | **4,086** | | Documents dropped (< 8 sentences) | 973 | | Documents dropped (near-duplicate) | 1,720 | | Sentences kept | **33,649,142** | Drop rate: ~40% of detected documents removed (mostly same-book re-uploads and too-short documents). ## Source & licensing - **Source dataset:** [`SamuelYang/bookcorpus`](https://huggingface.co/datasets/SamuelYang/bookcorpus) - **Original corpus:** BookCorpus (Zhu et al., 2015), originally scraped from Smashwords. The original BookCorpus has well-documented provenance and consent concerns; downstream users should review them before commercial use. - This cleaned derivative is released under the **MIT License** for the cleaning code and structuring effort. The underlying text retains whatever rights apply to the upstream source.
提供机构:
kd13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作