five

Tralalabs/simple-english-wikipedia

收藏
Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Tralalabs/simple-english-wikipedia
下载链接
链接失效反馈
官方服务:
资源简介:
# 📚 SimpleWiki Parquet Dataset A clean, structured dataset derived from the **Simple English Wikipedia** dump, converted into **Parquet format** for efficient machine learning and data processing workflows. --- ## 🧠 Overview This dataset contains Wikipedia articles extracted from the official Wikimedia dump and processed into a clean, training-ready format. - **Source:** https://dumps.wikimedia.org/simplewiki/latest/ - **Format:** Parquet - **Language:** English (Simple English) - **Size:** ~200k–300k articles - **Structure:** `id`, `title`, `text` --- ## 📦 Dataset Structure | Column | Type | Description | |--------|--------|-----------------------| | id | int64 | Unique page ID | | title | string | Article title | | text | string | Full article text | --- ## ⚙️ Processing Pipeline The dataset was created using the following steps: 1. Download official Wikipedia dump (`pages-articles.xml.bz2`) 2. Stream parse XML (no full memory load) 3. Extract: - Main namespace (articles only) - Latest revision only 4. Clean: - Remove empty pages - Filter short/low-quality text 5. Convert to Parquet using `pyarrow` --- ## 🚀 Usage ### Load with 🤗 Datasets ```python from datasets import load_dataset dataset = load_dataset("your-username/simplewiki-parquet") print(dataset["train"][0]) ```` ### Load with Pandas ```python import pandas as pd df = pd.read_parquet("wiki.parquet") print(df.head()) ``` --- ## 📜 License This dataset is derived from Wikipedia content and is licensed under: 👉 **Creative Commons Attribution-ShareAlike 4.0 (CC BY-SA 4.0)** * You may share, modify, and use commercially * You **must provide attribution** * You **must use the same license for derivatives** This is required because Wikipedia content uses a copyleft license that enforces share-alike redistribution. --- ## 🧾 Attribution This dataset includes content from: * **Wikipedia** — [https://www.wikipedia.org](https://www.wikipedia.org) * **Authors:** Wikipedia contributors **Proper attribution example:** > Content sourced from Wikipedia, licensed under CC BY-SA 4.0. --- ## ⚠️ Notes * This dataset does **not include full edit history** * Only **latest revisions** are included * Some articles may still contain **markup or formatting artifacts** * Dataset quality depends on Wikipedia content --- ## 🔥 Use Cases * LLM pretraining * Text embeddings * Semantic search * NLP research * Dataset experimentation (small-scale) --- ## 💀 Disclaimer This dataset is provided **as-is**. Users are responsible for ensuring compliance with the **CC BY-SA 4.0** license when redistributing or using derived works. --- ## 🧠 Credits * Wikimedia Foundation * Wikipedia contributors * Open-source tools: `mwxml`, `pyarrow`, `pandas` ``` ```
提供机构:
Tralalabs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作