five

Cyro1/enwiki_pageviews_m

收藏
Hugging Face2026-01-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Cyro1/enwiki_pageviews_m
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - other - text-classification language: - en size_categories: - 1M<n<10M tags: - wikipedia - pageviews - popularity - knowledge-graph - bias-detection - multi-year --- # English Wikipedia Pageviews (Multi-Year Average 2020-2023) This dataset combines Wikipedia article pageviews across multiple years (2020, 2021, 2022, 2023) into a single multi-year average. It provides more stable popularity metrics by averaging across years, reducing the impact of temporary spikes or trends. ## Dataset Description This dataset was created by: 1. Downloading individual yearly datasets (`Cyro1/enwiki_pageviews_YEAR_m`) 2. Combining all years for each article 3. Calculating the mean popularity and rank across all available years ## Features | Column | Type | Description | |--------|------|-------------| | `wikipedia_id` | int64 | Wikipedia article ID (page_id) | | `wikipedia_title` | string | Wikipedia article title | | `popularity_avg` | float64 | Average monthly pageviews across all years | | `rank_avg` | float64 | Average rank across all years | | `years_count` | int64 | Number of years this article appeared in (1-4) | ## Stats - **Articles**: 5,903,530 Wikipedia articles - **Years Combined**: 2020, 2021, 2022, 2023 - **Source**: Wikimedia pageview dumps, aggregated from yearly datasets - **Split**: 90% train / 10% test (seed=42) ## Advantages of Multi-Year Averaging - **Reduced variance**: Averages out seasonal variations and temporary events - **More stable metrics**: Better represents long-term article importance - **Robust to outliers**: Single-year spikes have less impact - **Better for research**: More suitable for studying structural popularity bias ## Usage ```python from datasets import load_dataset # Load the multi-year dataset ds = load_dataset("Cyro1/enwiki_pageviews_m") df = ds["train"].to_pandas() # Merge with your dataset merged = your_df.merge( df[["wikipedia_id", "popularity_avg", "rank_avg", "years_count"]], left_on="document_id", right_on="wikipedia_id", how="left" ) # Filter to articles present in all years for most stable metrics stable_articles = df[df["years_count"] == 4] ``` ## Related Datasets Individual yearly datasets are also available: - [`Cyro1/enwiki_pageviews_2020_m`](https://huggingface.co/datasets/Cyro1/enwiki_pageviews_2020_m) - [`Cyro1/enwiki_pageviews_2021_m`](https://huggingface.co/datasets/Cyro1/enwiki_pageviews_2021_m) - [`Cyro1/enwiki_pageviews_2022_m`](https://huggingface.co/datasets/Cyro1/enwiki_pageviews_2022_m) - [`Cyro1/enwiki_pageviews_2023_m`](https://huggingface.co/datasets/Cyro1/enwiki_pageviews_2023_m) ## Citation If you use this dataset in your research, please cite the Wikimedia pageview dumps: ``` @misc{wikimedia_pageviews, title = {Wikimedia Downloads}, author = {Wikimedia Foundation}, year = {2024}, url = {https://dumps.wikimedia.org/other/pageviews/} } ``` ## License This dataset is released under the MIT license, following the terms of the Wikimedia pageview data.
提供机构:
Cyro1
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作