bookscorpus_en
收藏魔搭社区2025-10-26 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/AlenglengLLM/bookscorpus_en
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for "wikipedia-bookscorpus-en-preprocessed"
## Dataset Summary
A preprocessed and normalized combination of English Wikipedia and BookCorpus datasets, optimized for BERT pretraining. The dataset is chunked into segments of ~820 characters to accommodate typical transformer architectures.
## Dataset Details
- **Number of Examples:** 29.4 million
- **Download Size:** 12.2 GB
- **Dataset Size:** 19.0 GB
### Features:
```python
{
'text': string, # The preprocessed text chunk
'is_filtered_out': bool # Filtering flag for data quality
}
```
## Processing Pipeline
1. **Language Filtering:**
- Retains only English language samples
- Uses langdetect for language detection
2. **Text Chunking:**
- Chunks of ~820 characters (targeting ~128 tokens)
- Preserves sentence boundaries where possible
- Splits on sentence endings (., !, ?) or spaces
3. **Normalization:**
- Converts to lowercase
- Removes accents and non-English characters
- Filters out chunks < 200 characters
- Removes special characters
4. **Data Organization:**
- Shuffled for efficient training
- Distributed across multiple JSONL files
- No need for additional dataset.shuffle() during training
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("shahrukhx01/wikipedia-bookscorpus-en-preprocessed")
```
## Preprocessing Details
For detailed information about the preprocessing pipeline, see the [preprocessing documentation](https://github.com/shahrukhx01/minions/tree/main/scripts/data/bert_pretraining_data/README.md).
## Limitations
- Some tokens may be lost due to chunk truncation
- Very long sentences might be split
- Some contextual information across chunk boundaries is lost
## Citation
If you use this dataset, please cite:
```
@misc{wikipedia-bookscorpus-en-preprocessed,
author = {Shahrukh Khan},
title = {Preprocessed Wikipedia and BookCorpus Dataset for Language Model Training},
year = {2024},
publisher = {GitHub & Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/shahrukhx01/wikipedia-bookscorpus-en-preprocessed}}
}
```
# "wikipedia-bookscorpus-en-preprocessed"数据集卡片
## 数据集概述
经预处理与归一化处理的英文维基百科(English Wikipedia)与BookCorpus数据集组合,专为BERT预训练优化。该数据集被切分为约820字符的片段,以适配典型Transformer(Transformer)架构。
## 数据集详情
- **示例数量:** 2940万
- **下载大小:** 12.2 GB
- **数据集大小:** 19.0 GB
### 特征:
python
{
'text': string, # 经过预处理的文本片段
'is_filtered_out': bool # 用于数据质量过滤的标记
}
## 处理流程
1. **语言过滤:**
- 仅保留英文语料
- 使用langdetect工具进行语言检测
2. **文本切分:**
- 切分为约820字符的片段(目标对应约128个Token(Token))
- 尽可能保留句子边界
- 以句号(.)、感叹号(!)、问号(?)或空格作为切分依据
3. **归一化处理:**
- 转换为小写格式
- 移除重音符号与非英文字符
- 过滤掉长度小于200字符的片段
- 移除特殊字符
4. **数据组织:**
- 进行洗牌操作以提升训练效率
- 拆分至多个JSONL文件中
- 训练过程中无需额外调用`dataset.shuffle()`
## 使用方式
python
from datasets import load_dataset
dataset = load_dataset("shahrukhx01/wikipedia-bookscorpus-en-preprocessed")
## 预处理详情
如需了解预处理流程的详细信息,请参阅[预处理文档](https://github.com/shahrukhx01/minions/tree/main/scripts/data/bert_pretraining_data/README.md)。
## 局限性
- 因片段截断可能会丢失部分Token(Token)
- 超长语句可能会被切分
- 跨片段边界的部分上下文信息会丢失
## 引用声明
若使用本数据集,请引用以下文献:
@misc{wikipedia-bookscorpus-en-preprocessed,
author = {Shahrukh Khan},
title = {Preprocessed Wikipedia and BookCorpus Dataset for Language Model Training},
year = {2024},
publisher = {GitHub & Hugging Face},
howpublished = {url{https://huggingface.co/datasets/shahrukhx01/wikipedia-bookscorpus-en-preprocessed}}
}
提供机构:
maas
创建时间:
2025-09-12



