JulianKrgd/Wikipedia_EN_6M

Name: JulianKrgd/Wikipedia_EN_6M
Creator: JulianKrgd
Published: 2025-12-09 19:49:43
License: 暂无描述

Hugging Face2025-12-09 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/JulianKrgd/Wikipedia_EN_6M

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-sa-4.0 size_categories: - 1M<n<10M task_categories: - text-generation - feature-extraction tags: - wikipedia - english - pretrain - llm - scraped pretty_name: Wikipedia EN dataset_info: features: - name: id dtype: string - name: title dtype: string - name: text dtype: string - name: url dtype: string - name: source dtype: string - name: language dtype: string - name: char_count dtype: int64 - name: word_count dtype: int64 - name: scraped_at dtype: string --- # Wikipedia EN Dataset English Wikipedia articles scraped and cleaned for LLM pre-training. ## Dataset Description | Property | Value | |----------|-------| | **Articles** | ~6,000,000 | | **Language** | English | | **Size** | ~25 GB | | **Format** | JSONL | | **Source** | Wikipedia (scraped December 2024) | | **License** | CC BY-SA 4.0 | ## Scraping Method This dataset was created by scraping and processing the English Wikipedia: 1. **Data Source**: Official Wikipedia XML dump 2. **Processing**: Custom Python scraper with multiprocessing (12 workers) 3. **Cleaning**: Wikitext markup removal (templates, references, HTML, categories) 4. **Filtering**: Removed redirects, stubs (<50 words), and non-article pages ### Scraping Pipeline ``` Wikipedia XML → Streaming Parser → Wikitext Cleaner → JSONL Output │ │ └── lxml (fast) └── Regex-based (compiled) └── Multiprocessing (12 cores) ``` ### Performance - **Speed**: ~1,500 articles/second - **Total Time**: ~1 hour for 6M articles - **Hardware**: Apple M4 Pro (12 cores) ## Data Format Each line is a JSON object: ```json { "id": "wikipedia_en_12345", "title": "Artificial intelligence", "text": "Artificial intelligence (AI) is the intelligence of machines...", "url": "https://en.wikipedia.org/wiki/Artificial_intelligence", "source": "wikipedia_en", "language": "en", "char_count": 45230, "word_count": 7523, "scraped_at": "2024-12-08T22:15:00" } ``` ## Usage ### With Hugging Face Datasets ```python from datasets import load_dataset dataset = load_dataset("JulianKrgd/Wikipedia-En") ``` ### Direct JSONL Loading ```python import json with open("wikipedia_en_dump.jsonl", "r") as f: for line in f: article = json.loads(line) print(article["title"], "-", article["word_count"], "words") ``` ### Streaming (Memory Efficient) ```python import json def stream_articles(filepath): with open(filepath, "r") as f: for line in f: yield json.loads(line) for article in stream_articles("wikipedia_en_dump.jsonl"): # Process one article at a time pass ``` ## Filtering Applied | Filter | Removed | |--------|---------| | Redirects | ~8M pages | | Namespace pages | Talk, User, Wikipedia, File, Template, etc. | | Short articles | < 200 characters or < 50 words | | Empty pages | No text content | ## Wikitext Cleaning The following markup was removed: - `{{templates}}` - Infoboxes, citations, etc. - `[[Category:...]]` - Category links - `[[File:...]]` - Image references - `<ref>...</ref>` - References and footnotes - `'''bold'''` / `''italic''` - Formatting - `== Headings ==` - Section headers - `* bullets` - List markers - `{| tables |}` - Wiki tables - `` - HTML comments - External links `[http://...]` ## Intended Use - LLM pre-training - English NLP research - Text generation fine-tuning - Knowledge base construction - Semantic search / embeddings ## Statistics | Metric | Value | |--------|-------| | Total Articles | ~6,000,000 | | Avg Words/Article | ~600 | | Avg Chars/Article | ~4,000 | | Total Words | ~3.6 billion | | Total Tokens (est.) | ~4.5 billion | ## Citation ```bibtex @dataset{wikipedia_en_2024, title={Wikipedia EN Dataset}, author={JulianKrgd}, year={2024}, url={https://huggingface.co/datasets/JulianKrgd/Wikipedia-En} } ``` ## Acknowledgments Data sourced from [Wikipedia](https://en.wikipedia.org/) under CC BY-SA 4.0 license.

提供机构：

JulianKrgd

5,000+

优质数据集

54 个

任务类型

进入经典数据集