Tralalabs/simple-english-wikipedia
收藏Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Tralalabs/simple-english-wikipedia
下载链接
链接失效反馈官方服务:
资源简介:
# 📚 SimpleWiki Parquet Dataset
A clean, structured dataset derived from the **Simple English Wikipedia** dump, converted into **Parquet format** for efficient machine learning and data processing workflows.
---
## 🧠 Overview
This dataset contains Wikipedia articles extracted from the official Wikimedia dump and processed into a clean, training-ready format.
- **Source:** https://dumps.wikimedia.org/simplewiki/latest/
- **Format:** Parquet
- **Language:** English (Simple English)
- **Size:** ~200k–300k articles
- **Structure:** `id`, `title`, `text`
---
## 📦 Dataset Structure
| Column | Type | Description |
|--------|--------|-----------------------|
| id | int64 | Unique page ID |
| title | string | Article title |
| text | string | Full article text |
---
## ⚙️ Processing Pipeline
The dataset was created using the following steps:
1. Download official Wikipedia dump (`pages-articles.xml.bz2`)
2. Stream parse XML (no full memory load)
3. Extract:
- Main namespace (articles only)
- Latest revision only
4. Clean:
- Remove empty pages
- Filter short/low-quality text
5. Convert to Parquet using `pyarrow`
---
## 🚀 Usage
### Load with 🤗 Datasets
```python
from datasets import load_dataset
dataset = load_dataset("your-username/simplewiki-parquet")
print(dataset["train"][0])
````
### Load with Pandas
```python
import pandas as pd
df = pd.read_parquet("wiki.parquet")
print(df.head())
```
---
## 📜 License
This dataset is derived from Wikipedia content and is licensed under:
👉 **Creative Commons Attribution-ShareAlike 4.0 (CC BY-SA 4.0)**
* You may share, modify, and use commercially
* You **must provide attribution**
* You **must use the same license for derivatives**
This is required because Wikipedia content uses a copyleft license that enforces share-alike redistribution.
---
## 🧾 Attribution
This dataset includes content from:
* **Wikipedia** — [https://www.wikipedia.org](https://www.wikipedia.org)
* **Authors:** Wikipedia contributors
**Proper attribution example:**
> Content sourced from Wikipedia, licensed under CC BY-SA 4.0.
---
## ⚠️ Notes
* This dataset does **not include full edit history**
* Only **latest revisions** are included
* Some articles may still contain **markup or formatting artifacts**
* Dataset quality depends on Wikipedia content
---
## 🔥 Use Cases
* LLM pretraining
* Text embeddings
* Semantic search
* NLP research
* Dataset experimentation (small-scale)
---
## 💀 Disclaimer
This dataset is provided **as-is**. Users are responsible for ensuring compliance with the **CC BY-SA 4.0** license when redistributing or using derived works.
---
## 🧠 Credits
* Wikimedia Foundation
* Wikipedia contributors
* Open-source tools: `mwxml`, `pyarrow`, `pandas`
```
```
提供机构:
Tralalabs



