JulianKrgd/Wikipedia_EN_6M
收藏Hugging Face2025-12-09 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/JulianKrgd/Wikipedia_EN_6M
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-sa-4.0
size_categories:
- 1M<n<10M
task_categories:
- text-generation
- feature-extraction
tags:
- wikipedia
- english
- pretrain
- llm
- scraped
pretty_name: Wikipedia EN
dataset_info:
features:
- name: id
dtype: string
- name: title
dtype: string
- name: text
dtype: string
- name: url
dtype: string
- name: source
dtype: string
- name: language
dtype: string
- name: char_count
dtype: int64
- name: word_count
dtype: int64
- name: scraped_at
dtype: string
---
# Wikipedia EN Dataset
English Wikipedia articles scraped and cleaned for LLM pre-training.
## Dataset Description
| Property | Value |
|----------|-------|
| **Articles** | ~6,000,000 |
| **Language** | English |
| **Size** | ~25 GB |
| **Format** | JSONL |
| **Source** | Wikipedia (scraped December 2024) |
| **License** | CC BY-SA 4.0 |
## Scraping Method
This dataset was created by scraping and processing the English Wikipedia:
1. **Data Source**: Official Wikipedia XML dump
2. **Processing**: Custom Python scraper with multiprocessing (12 workers)
3. **Cleaning**: Wikitext markup removal (templates, references, HTML, categories)
4. **Filtering**: Removed redirects, stubs (<50 words), and non-article pages
### Scraping Pipeline
```
Wikipedia XML → Streaming Parser → Wikitext Cleaner → JSONL Output
│ │
└── lxml (fast) └── Regex-based (compiled)
└── Multiprocessing (12 cores)
```
### Performance
- **Speed**: ~1,500 articles/second
- **Total Time**: ~1 hour for 6M articles
- **Hardware**: Apple M4 Pro (12 cores)
## Data Format
Each line is a JSON object:
```json
{
"id": "wikipedia_en_12345",
"title": "Artificial intelligence",
"text": "Artificial intelligence (AI) is the intelligence of machines...",
"url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
"source": "wikipedia_en",
"language": "en",
"char_count": 45230,
"word_count": 7523,
"scraped_at": "2024-12-08T22:15:00"
}
```
## Usage
### With Hugging Face Datasets
```python
from datasets import load_dataset
dataset = load_dataset("JulianKrgd/Wikipedia-En")
```
### Direct JSONL Loading
```python
import json
with open("wikipedia_en_dump.jsonl", "r") as f:
for line in f:
article = json.loads(line)
print(article["title"], "-", article["word_count"], "words")
```
### Streaming (Memory Efficient)
```python
import json
def stream_articles(filepath):
with open(filepath, "r") as f:
for line in f:
yield json.loads(line)
for article in stream_articles("wikipedia_en_dump.jsonl"):
# Process one article at a time
pass
```
## Filtering Applied
| Filter | Removed |
|--------|---------|
| Redirects | ~8M pages |
| Namespace pages | Talk, User, Wikipedia, File, Template, etc. |
| Short articles | < 200 characters or < 50 words |
| Empty pages | No text content |
## Wikitext Cleaning
The following markup was removed:
- `{{templates}}` - Infoboxes, citations, etc.
- `[[Category:...]]` - Category links
- `[[File:...]]` - Image references
- `<ref>...</ref>` - References and footnotes
- `'''bold'''` / `''italic''` - Formatting
- `== Headings ==` - Section headers
- `* bullets` - List markers
- `{| tables |}` - Wiki tables
- `<!-- comments -->` - HTML comments
- External links `[http://...]`
## Intended Use
- LLM pre-training
- English NLP research
- Text generation fine-tuning
- Knowledge base construction
- Semantic search / embeddings
## Statistics
| Metric | Value |
|--------|-------|
| Total Articles | ~6,000,000 |
| Avg Words/Article | ~600 |
| Avg Chars/Article | ~4,000 |
| Total Words | ~3.6 billion |
| Total Tokens (est.) | ~4.5 billion |
## Citation
```bibtex
@dataset{wikipedia_en_2024,
title={Wikipedia EN Dataset},
author={JulianKrgd},
year={2024},
url={https://huggingface.co/datasets/JulianKrgd/Wikipedia-En}
}
```
## Acknowledgments
Data sourced from [Wikipedia](https://en.wikipedia.org/) under CC BY-SA 4.0 license.
提供机构:
JulianKrgd



