five

bookscorpus_en

收藏
魔搭社区2025-10-26 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/AlenglengLLM/bookscorpus_en
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for "wikipedia-bookscorpus-en-preprocessed" ## Dataset Summary A preprocessed and normalized combination of English Wikipedia and BookCorpus datasets, optimized for BERT pretraining. The dataset is chunked into segments of ~820 characters to accommodate typical transformer architectures. ## Dataset Details - **Number of Examples:** 29.4 million - **Download Size:** 12.2 GB - **Dataset Size:** 19.0 GB ### Features: ```python { 'text': string, # The preprocessed text chunk 'is_filtered_out': bool # Filtering flag for data quality } ``` ## Processing Pipeline 1. **Language Filtering:** - Retains only English language samples - Uses langdetect for language detection 2. **Text Chunking:** - Chunks of ~820 characters (targeting ~128 tokens) - Preserves sentence boundaries where possible - Splits on sentence endings (., !, ?) or spaces 3. **Normalization:** - Converts to lowercase - Removes accents and non-English characters - Filters out chunks < 200 characters - Removes special characters 4. **Data Organization:** - Shuffled for efficient training - Distributed across multiple JSONL files - No need for additional dataset.shuffle() during training ## Usage ```python from datasets import load_dataset dataset = load_dataset("shahrukhx01/wikipedia-bookscorpus-en-preprocessed") ``` ## Preprocessing Details For detailed information about the preprocessing pipeline, see the [preprocessing documentation](https://github.com/shahrukhx01/minions/tree/main/scripts/data/bert_pretraining_data/README.md). ## Limitations - Some tokens may be lost due to chunk truncation - Very long sentences might be split - Some contextual information across chunk boundaries is lost ## Citation If you use this dataset, please cite: ``` @misc{wikipedia-bookscorpus-en-preprocessed, author = {Shahrukh Khan}, title = {Preprocessed Wikipedia and BookCorpus Dataset for Language Model Training}, year = {2024}, publisher = {GitHub & Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/shahrukhx01/wikipedia-bookscorpus-en-preprocessed}} } ```

# "wikipedia-bookscorpus-en-preprocessed"数据集卡片 ## 数据集概述 经预处理与归一化处理的英文维基百科(English Wikipedia)与BookCorpus数据集组合,专为BERT预训练优化。该数据集被切分为约820字符的片段,以适配典型Transformer(Transformer)架构。 ## 数据集详情 - **示例数量:** 2940万 - **下载大小:** 12.2 GB - **数据集大小:** 19.0 GB ### 特征: python { 'text': string, # 经过预处理的文本片段 'is_filtered_out': bool # 用于数据质量过滤的标记 } ## 处理流程 1. **语言过滤:** - 仅保留英文语料 - 使用langdetect工具进行语言检测 2. **文本切分:** - 切分为约820字符的片段(目标对应约128个Token(Token)) - 尽可能保留句子边界 - 以句号(.)、感叹号(!)、问号(?)或空格作为切分依据 3. **归一化处理:** - 转换为小写格式 - 移除重音符号与非英文字符 - 过滤掉长度小于200字符的片段 - 移除特殊字符 4. **数据组织:** - 进行洗牌操作以提升训练效率 - 拆分至多个JSONL文件中 - 训练过程中无需额外调用`dataset.shuffle()` ## 使用方式 python from datasets import load_dataset dataset = load_dataset("shahrukhx01/wikipedia-bookscorpus-en-preprocessed") ## 预处理详情 如需了解预处理流程的详细信息,请参阅[预处理文档](https://github.com/shahrukhx01/minions/tree/main/scripts/data/bert_pretraining_data/README.md)。 ## 局限性 - 因片段截断可能会丢失部分Token(Token) - 超长语句可能会被切分 - 跨片段边界的部分上下文信息会丢失 ## 引用声明 若使用本数据集,请引用以下文献: @misc{wikipedia-bookscorpus-en-preprocessed, author = {Shahrukh Khan}, title = {Preprocessed Wikipedia and BookCorpus Dataset for Language Model Training}, year = {2024}, publisher = {GitHub & Hugging Face}, howpublished = {url{https://huggingface.co/datasets/shahrukhx01/wikipedia-bookscorpus-en-preprocessed}} }
提供机构:
maas
创建时间:
2025-09-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作