five

JulianKrgd/wikipedia-en-julian

收藏
Hugging Face2026-01-15 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/JulianKrgd/wikipedia-en-julian
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-sa-3.0 task_categories: - text-generation - fill-mask size_categories: - 1M<n<10M tags: - wikipedia - language-modeling pretty_name: Wikipedia English for JULIAN dataset_info: features: - name: title dtype: string - name: text dtype: string - name: url dtype: string - name: language dtype: string splits: - name: train num_examples: 3289977 --- # Wikipedia English - JULIAN Training Dataset This dataset contains cleaned English Wikipedia articles used to train the **JULIAN-100M** language model. ## Dataset Description - **Language**: English - **Source**: Wikipedia dumps (latest available) - **Size**: ~3.5 billion tokens (~9.8GB JSONL, ~2-3GB Parquet) - **Format**: Cleaned articles with title, text, and URL - **License**: Creative Commons Attribution-ShareAlike 3.0 ## Dataset Structure ### Data Fields - `title` (string): Article title - `text` (string): Full article text (cleaned and formatted) - `url` (string): Original Wikipedia URL - `language` (string): Language code ("en") ### Data Example ```json { "title": "Artificial Intelligence", "text": "Artificial intelligence (AI) is intelligence demonstrated by machines...", "url": "https://en.wikipedia.org/wiki/Artificial_Intelligence", "language": "en" } ``` ## Data Collection ### Source Downloaded from [Wikimedia dumps](https://dumps.wikimedia.org/enwiki/) (English Wikipedia). ### Processing Pipeline 1. **Download**: Latest Wikipedia XML dump 2. **Extraction**: Parse XML, extract article text 3. **Cleaning**: - Remove Wiki markup and templates - Remove infoboxes and navigation elements - Clean HTML entities and special characters - Remove very short articles (<50 characters) - Remove duplicate content 4. **Filtering**: - Keep only main namespace articles - Remove disambiguation and redirect pages - Filter low-quality content 5. **Formatting**: Convert to JSONL with structured fields ### Statistics | Metric | Value | |--------|-------| | Total Articles | ~6.5 million | | Total Tokens | ~3.5 billion | | Average Article Length | ~540 tokens | | Total Characters | ~21 billion | ## Usage ### Loading with Datasets Library ```python from datasets import load_dataset # Load full dataset dataset = load_dataset("juliankerignard/wikipedia-en-julian", split="train") # Stream for large datasets dataset = load_dataset("juliankerignard/wikipedia-en-julian", split="train", streaming=True) # Example: Get first article print(dataset[0]['title']) print(dataset[0]['text'][:200]) ``` ### Training Example ```python from datasets import load_dataset import sentencepiece as spm # Load dataset dataset = load_dataset("juliankerignard/wikipedia-en-julian", split="train", streaming=True) # Load tokenizer tokenizer = spm.SentencePieceProcessor() tokenizer.Load("julian_24k.model") # Tokenize and prepare for training def tokenize_function(examples): return {"input_ids": tokenizer.EncodeAsIds(examples["text"])} tokenized_dataset = dataset.map(tokenize_function, batched=True) ``` ## Limitations and Bias ### Limitations 1. **Wikipedia Bias**: Reflects Wikipedia's editorial policies and contributor demographics 2. **Coverage Gaps**: Some topics are over-represented (technology, Western culture), others under-represented 3. **Temporal Snapshot**: Knowledge is frozen at the time of the dump 4. **Style Homogeneity**: Encyclopedia writing style, not conversational or creative writing ### Potential Biases - **Geographic**: English Wikipedia has more coverage of English-speaking countries - **Demographic**: Reflects Wikipedia editor demographics (primarily male, Western) - **Topic**: Technology and pop culture are over-represented vs. non-Western topics - **Recency**: Recent events have more coverage than historical topics ### Ethical Considerations - Contains encyclopedic content, which may include sensitive topics - Not suitable for training models to be used in high-stakes decision making - Users should be aware of Wikipedia's known biases when using this dataset - Recommended for research and educational purposes ## License This dataset is derived from Wikipedia content, which is licensed under: - **Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)** - **GNU Free Documentation License (GFDL)** See [Wikipedia's copyright policy](https://en.wikipedia.org/wiki/Wikipedia:Copyrights) for details. ## Citation If you use this dataset, please cite: ```bibtex @misc{julian_wikipedia_en_2025, title={Wikipedia English - JULIAN Training Dataset}, author={Julian Kerignard}, year={2025}, howpublished={\\url{https://huggingface.co/datasets/juliankerignard/wikipedia-en-julian}}, note={Derived from English Wikipedia dumps} } ``` Also cite the original Wikipedia content: ```bibtex @misc{wikipedia_en, author = "{Wikipedia contributors}", title = "English Wikipedia", year = "2025", howpublished = {\\url{https://en.wikipedia.org/}}, note = "[Online; accessed DATE]" } ``` ## Related Resources - **Model**: [JULIAN-100M](https://huggingface.co/juliankerignard/JULIAN-100M) - Trained on this dataset - **French Dataset**: [wikipedia-fr-julian](https://huggingface.co/datasets/juliankerignard/wikipedia-fr-julian) - **Tokenizer**: Included in JULIAN-100M model repository ## Contact - **Author**: Julian Kerignard - **HuggingFace**: https://huggingface.co/juliankerignard --- **Note**: This is a research dataset created for training the JULIAN-100M language model. For the latest Wikipedia content, please visit [wikipedia.org](https://wikipedia.org).
提供机构:
JulianKrgd
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作