five

semvec/pypi-packages

收藏
Hugging Face2026-02-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/semvec/pypi-packages
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 size_categories: - 100K<n<1M task_categories: - tabular-classification - text-classification tags: - pypi - python - software-engineering - metadata pretty_name: PyPI Download and Package Analysis configs: - config_name: default data_files: - split: train path: packages_data/*.parquet dataset_info: features: - name: metadata_version dtype: string - name: name dtype: string - name: version dtype: string - name: summary dtype: string - name: description dtype: string - name: description_content_type dtype: string - name: author dtype: string - name: author_email dtype: string - name: maintainer dtype: string - name: maintainer_email dtype: string - name: license dtype: string - name: keywords dtype: string - name: classifiers sequence: string - name: platform sequence: string - name: home_page dtype: string - name: download_url dtype: string - name: requires_python dtype: string - name: requires sequence: string - name: provides sequence: string - name: obsoletes sequence: string - name: requires_dist sequence: string - name: provides_dist sequence: string - name: obsoletes_dist sequence: string - name: requires_external sequence: string - name: project_urls sequence: string - name: uploaded_via dtype: string - name: upload_time dtype: timestamp[us] - name: filename dtype: string - name: size dtype: int64 - name: path dtype: string - name: python_version dtype: string - name: packagetype dtype: string - name: comment_text dtype: string - name: has_signature dtype: bool - name: md5_digest dtype: string - name: sha256_digest dtype: string - name: blake2_256_digest dtype: string - name: license_expression dtype: string - name: license_files sequence: string - name: recent_7d_downloads dtype: int64 --- # PyPI Download and Package Analysis A comprehensive snapshot of the Python Package Index (PyPI), covering **690,775 packages** published from April 2005 through February 2026. Each row represents a published package release, enriched with full metadata from the PyPI API and recent download statistics from the PyPI BigQuery public dataset. ## Dataset at a Glance | Stat | Value | |---|---| | Total packages | 690,775 | | Date range | April 2005 – February 2026 | | Total 7-day downloads | ~13.3 Billion | | Format | Parquet (15 shards) | | License | Apache-2.0 | ## Data Fields | Field | Type | Description | |---|---|---| | `name` | string | Package name on PyPI (unique identifier) | | `version` | string | Release version string (PEP 440) | | `summary` | string | One-line description of the package | | `description` | string | Full project description / README text | | `description_content_type` | string | MIME type of the description (e.g. `text/markdown`) | | `author` | string | Primary author name | | `author_email` | string | Primary author email | | `maintainer` | string | Maintainer name (if different from author) | | `maintainer_email` | string | Maintainer email | | `license` | string | License string as declared by the author | | `keywords` | string | Space- or comma-separated keywords | | `classifiers` | list[string] | PyPI trove classifiers (e.g. `Programming Language :: Python :: 3`) | | `platform` | list[string] | Target platforms declared by the author | | `home_page` | string | Project homepage URL | | `download_url` | string | Direct download URL (if provided) | | `requires_python` | string | Python version constraint (e.g. `>=3.8`) | | `requires` | list[string] | Runtime dependencies | | `project_urls` | list[string] | Additional URLs (source, docs, tracker, etc.) | | `upload_time` | timestamp | UTC timestamp of when this release was uploaded | | `size` | int64 | Size of the distribution file in bytes | | `packagetype` | string | Distribution type: `sdist`, `bdist_wheel`, etc. | | `metadata_version` | string | Metadata specification version | | `recent_7d_downloads` | int64 | Total downloads in the most recent 7-day window | ## Usage ```python from datasets import load_dataset ds = load_dataset("semvec/pypi-packages") df = ds["train"].to_pandas() # Top 10 most downloaded packages df.sort_values("recent_7d_downloads", ascending=False).head(10)[["name", "summary", "recent_7d_downloads"]] ``` ## Example Use Cases - **Trend Analysis** — Track adoption of ecosystems (AI/ML, web frameworks, DevOps tooling) by filtering classifiers and plotting `upload_time` vs. cumulative package count. - **Package Classification / NLP** — Use `summary` and `description` to train text classifiers or summarization models that categorize packages by domain. - **Dependency Graph Research** — Parse `requires` to construct a directed dependency graph of the entire Python ecosystem. - **Popularity Modeling** — Predict `recent_7d_downloads` from metadata features like `requires_python`, `classifiers`, description length, and age. - **License Compliance** — Audit license diversity across the ecosystem and identify packages with missing or ambiguous license declarations. - **Author & Maintainer Analysis** — Study open-source contribution patterns, prolific authors, and package maintainer turnover over time. ## Data Collection Metadata was fetched from the [PyPI JSON API](https://pypi.org/pypi/{package}/json) for every package listed in the PyPI simple index. Download counts were sourced from the [PyPI public BigQuery dataset](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=pypi&page=dataset) (`bigquery-public-data.pypi.file_downloads`), aggregated over the 7 days preceding the collection date (February 2026). ## Citation If you use this dataset in your research, please cite it as: ``` @dataset{pypi_packages_2026, title = {PyPI Download and Package Analysis}, author = {semvec}, year = {2026}, url = {https://huggingface.co/datasets/semvec/pypi-packages}, license = {Apache-2.0} } ```
提供机构:
semvec
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作