JulianKrgd/wikipedia-en-julian
收藏Hugging Face2026-01-15 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/JulianKrgd/wikipedia-en-julian
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-sa-3.0
task_categories:
- text-generation
- fill-mask
size_categories:
- 1M<n<10M
tags:
- wikipedia
- language-modeling
pretty_name: Wikipedia English for JULIAN
dataset_info:
features:
- name: title
dtype: string
- name: text
dtype: string
- name: url
dtype: string
- name: language
dtype: string
splits:
- name: train
num_examples: 3289977
---
# Wikipedia English - JULIAN Training Dataset
This dataset contains cleaned English Wikipedia articles used to train the **JULIAN-100M** language model.
## Dataset Description
- **Language**: English
- **Source**: Wikipedia dumps (latest available)
- **Size**: ~3.5 billion tokens (~9.8GB JSONL, ~2-3GB Parquet)
- **Format**: Cleaned articles with title, text, and URL
- **License**: Creative Commons Attribution-ShareAlike 3.0
## Dataset Structure
### Data Fields
- `title` (string): Article title
- `text` (string): Full article text (cleaned and formatted)
- `url` (string): Original Wikipedia URL
- `language` (string): Language code ("en")
### Data Example
```json
{
"title": "Artificial Intelligence",
"text": "Artificial intelligence (AI) is intelligence demonstrated by machines...",
"url": "https://en.wikipedia.org/wiki/Artificial_Intelligence",
"language": "en"
}
```
## Data Collection
### Source
Downloaded from [Wikimedia dumps](https://dumps.wikimedia.org/enwiki/) (English Wikipedia).
### Processing Pipeline
1. **Download**: Latest Wikipedia XML dump
2. **Extraction**: Parse XML, extract article text
3. **Cleaning**:
- Remove Wiki markup and templates
- Remove infoboxes and navigation elements
- Clean HTML entities and special characters
- Remove very short articles (<50 characters)
- Remove duplicate content
4. **Filtering**:
- Keep only main namespace articles
- Remove disambiguation and redirect pages
- Filter low-quality content
5. **Formatting**: Convert to JSONL with structured fields
### Statistics
| Metric | Value |
|--------|-------|
| Total Articles | ~6.5 million |
| Total Tokens | ~3.5 billion |
| Average Article Length | ~540 tokens |
| Total Characters | ~21 billion |
## Usage
### Loading with Datasets Library
```python
from datasets import load_dataset
# Load full dataset
dataset = load_dataset("juliankerignard/wikipedia-en-julian", split="train")
# Stream for large datasets
dataset = load_dataset("juliankerignard/wikipedia-en-julian", split="train", streaming=True)
# Example: Get first article
print(dataset[0]['title'])
print(dataset[0]['text'][:200])
```
### Training Example
```python
from datasets import load_dataset
import sentencepiece as spm
# Load dataset
dataset = load_dataset("juliankerignard/wikipedia-en-julian", split="train", streaming=True)
# Load tokenizer
tokenizer = spm.SentencePieceProcessor()
tokenizer.Load("julian_24k.model")
# Tokenize and prepare for training
def tokenize_function(examples):
return {"input_ids": tokenizer.EncodeAsIds(examples["text"])}
tokenized_dataset = dataset.map(tokenize_function, batched=True)
```
## Limitations and Bias
### Limitations
1. **Wikipedia Bias**: Reflects Wikipedia's editorial policies and contributor demographics
2. **Coverage Gaps**: Some topics are over-represented (technology, Western culture), others under-represented
3. **Temporal Snapshot**: Knowledge is frozen at the time of the dump
4. **Style Homogeneity**: Encyclopedia writing style, not conversational or creative writing
### Potential Biases
- **Geographic**: English Wikipedia has more coverage of English-speaking countries
- **Demographic**: Reflects Wikipedia editor demographics (primarily male, Western)
- **Topic**: Technology and pop culture are over-represented vs. non-Western topics
- **Recency**: Recent events have more coverage than historical topics
### Ethical Considerations
- Contains encyclopedic content, which may include sensitive topics
- Not suitable for training models to be used in high-stakes decision making
- Users should be aware of Wikipedia's known biases when using this dataset
- Recommended for research and educational purposes
## License
This dataset is derived from Wikipedia content, which is licensed under:
- **Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)**
- **GNU Free Documentation License (GFDL)**
See [Wikipedia's copyright policy](https://en.wikipedia.org/wiki/Wikipedia:Copyrights) for details.
## Citation
If you use this dataset, please cite:
```bibtex
@misc{julian_wikipedia_en_2025,
title={Wikipedia English - JULIAN Training Dataset},
author={Julian Kerignard},
year={2025},
howpublished={\\url{https://huggingface.co/datasets/juliankerignard/wikipedia-en-julian}},
note={Derived from English Wikipedia dumps}
}
```
Also cite the original Wikipedia content:
```bibtex
@misc{wikipedia_en,
author = "{Wikipedia contributors}",
title = "English Wikipedia",
year = "2025",
howpublished = {\\url{https://en.wikipedia.org/}},
note = "[Online; accessed DATE]"
}
```
## Related Resources
- **Model**: [JULIAN-100M](https://huggingface.co/juliankerignard/JULIAN-100M) - Trained on this dataset
- **French Dataset**: [wikipedia-fr-julian](https://huggingface.co/datasets/juliankerignard/wikipedia-fr-julian)
- **Tokenizer**: Included in JULIAN-100M model repository
## Contact
- **Author**: Julian Kerignard
- **HuggingFace**: https://huggingface.co/juliankerignard
---
**Note**: This is a research dataset created for training the JULIAN-100M language model. For the latest Wikipedia content, please visit [wikipedia.org](https://wikipedia.org).
提供机构:
JulianKrgd



