Cyro1/enwiki_pageviews_m
收藏Hugging Face2026-01-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Cyro1/enwiki_pageviews_m
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- other
- text-classification
language:
- en
size_categories:
- 1M<n<10M
tags:
- wikipedia
- pageviews
- popularity
- knowledge-graph
- bias-detection
- multi-year
---
# English Wikipedia Pageviews (Multi-Year Average 2020-2023)
This dataset combines Wikipedia article pageviews across multiple years (2020, 2021, 2022, 2023) into a single multi-year average. It provides more stable popularity metrics by averaging across years, reducing the impact of temporary spikes or trends.
## Dataset Description
This dataset was created by:
1. Downloading individual yearly datasets (`Cyro1/enwiki_pageviews_YEAR_m`)
2. Combining all years for each article
3. Calculating the mean popularity and rank across all available years
## Features
| Column | Type | Description |
|--------|------|-------------|
| `wikipedia_id` | int64 | Wikipedia article ID (page_id) |
| `wikipedia_title` | string | Wikipedia article title |
| `popularity_avg` | float64 | Average monthly pageviews across all years |
| `rank_avg` | float64 | Average rank across all years |
| `years_count` | int64 | Number of years this article appeared in (1-4) |
## Stats
- **Articles**: 5,903,530 Wikipedia articles
- **Years Combined**: 2020, 2021, 2022, 2023
- **Source**: Wikimedia pageview dumps, aggregated from yearly datasets
- **Split**: 90% train / 10% test (seed=42)
## Advantages of Multi-Year Averaging
- **Reduced variance**: Averages out seasonal variations and temporary events
- **More stable metrics**: Better represents long-term article importance
- **Robust to outliers**: Single-year spikes have less impact
- **Better for research**: More suitable for studying structural popularity bias
## Usage
```python
from datasets import load_dataset
# Load the multi-year dataset
ds = load_dataset("Cyro1/enwiki_pageviews_m")
df = ds["train"].to_pandas()
# Merge with your dataset
merged = your_df.merge(
df[["wikipedia_id", "popularity_avg", "rank_avg", "years_count"]],
left_on="document_id",
right_on="wikipedia_id",
how="left"
)
# Filter to articles present in all years for most stable metrics
stable_articles = df[df["years_count"] == 4]
```
## Related Datasets
Individual yearly datasets are also available:
- [`Cyro1/enwiki_pageviews_2020_m`](https://huggingface.co/datasets/Cyro1/enwiki_pageviews_2020_m)
- [`Cyro1/enwiki_pageviews_2021_m`](https://huggingface.co/datasets/Cyro1/enwiki_pageviews_2021_m)
- [`Cyro1/enwiki_pageviews_2022_m`](https://huggingface.co/datasets/Cyro1/enwiki_pageviews_2022_m)
- [`Cyro1/enwiki_pageviews_2023_m`](https://huggingface.co/datasets/Cyro1/enwiki_pageviews_2023_m)
## Citation
If you use this dataset in your research, please cite the Wikimedia pageview dumps:
```
@misc{wikimedia_pageviews,
title = {Wikimedia Downloads},
author = {Wikimedia Foundation},
year = {2024},
url = {https://dumps.wikimedia.org/other/pageviews/}
}
```
## License
This dataset is released under the MIT license, following the terms of the Wikimedia pageview data.
提供机构:
Cyro1



