five

open-index/open-npm

收藏
Hugging Face2026-04-11 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/open-index/open-npm
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by task_categories: - text-classification - feature-extraction language: - en pretty_name: npm Registry - Complete Package Archive size_categories: - 1M<n<10M tags: - npm - javascript - typescript - nodejs - packages - registry - parquet - open-source configs: - config_name: packages data_files: - split: train path: data/packages/*.parquet - config_name: versions data_files: - split: train path: data/versions/*.parquet - config_name: maintainers data_files: - split: train path: data/maintainers/*.parquet - config_name: keywords data_files: - split: train path: data/keywords/*.parquet - config_name: dependencies data_files: - split: train path: data/dependencies/*.parquet - config_name: downloads data_files: - split: train path: data/downloads/*.parquet - config_name: download_days data_files: - split: train path: data/download_days/*.parquet - config_name: version_downloads data_files: - split: train path: data/version_downloads/*.parquet --- # npm Registry - Complete Package Archive > Every npm package with full metadata, versions, dependencies, and download stats ## Table of Contents - [What is it?](#what-is-it) - [What is being released?](#what-is-being-released) - [Dataset statistics](#dataset-statistics) - [Ecosystem snapshot](#ecosystem-snapshot) - [How to download and use this dataset](#how-to-download-and-use-this-dataset) - [Dataset card](#dataset-card-for-npm-registry---complete-package-archive) - [Dataset summary](#dataset-summary) - [Dataset structure](#dataset-structure) - [Dataset creation](#dataset-creation) - [Considerations for using the data](#considerations-for-using-the-data) - [Additional information](#additional-information) ## What is it? This dataset contains a comprehensive snapshot of the [npm registry](https://www.npmjs.com), the default package manager for Node.js and the largest software package registry in the world. npm hosts millions of packages and serves billions of downloads every week. If you have ever run `npm install`, you have used the registry that this dataset mirrors. The archive currently contains **763,956 packages** with **21,903,256 published versions**, maintained by **232,552 unique maintainers**. Every field needed to recreate the npmjs.com package detail page is included: core metadata, all historical versions, maintainer lists, keyword tags, dependency graphs, download statistics, and search quality scores. We built this dataset because npm registry data is scattered across four separate APIs (registry, replication, downloads, search), each with different rate limits, pagination schemes, and response formats. Researchers studying the JavaScript ecosystem typically need to build their own crawlers and deal with these complexities from scratch. Having everything in a single, queryable Parquet archive makes it straightforward to analyze the JavaScript ecosystem at scale. No API keys, no rate limits, no pagination. ## What is being released? The dataset is organized as 8 tables, each split into numbered Parquet shards for incremental updates. All files use Zstandard compression and are sorted by primary key for efficient range scans. ``` data/ packages/0000.parquet core metadata (35+ columns per package) packages/0001.parquet 100k rows per shard, sorted by name versions/0000.parquet every published version (35+ columns) versions/0001.parquet 500k rows per shard maintainers/0000.parquet package maintainers keywords/0000.parquet keyword tags dependencies/0000.parquet all dependency types downloads/0000.parquet point download totals download_days/0000.parquet daily breakdown (sparkline data) version_downloads/0000.parquet per-version download counts ``` ## Dataset statistics | Table | Rows | Description | |-------|-----:|-------------| | packages | 763,956 | One row per package: name, description, license, readme, author, types, scores | | versions | 21,903,256 | Every published version: entry points, dist info, engine requirements | | maintainers | 1,707,762 | Package-maintainer relationships | | keywords | 2,458,359 | Package-keyword relationships | | dependencies | 8,486,201 | Runtime, dev, peer, optional, and bundled dependencies | | downloads | 0 | Point download totals (last-day, last-week, last-month) | | download_days | 0 | Daily download counts for sparkline charts | | version_downloads | 0 | Per-version download breakdown | **Versions per package:** median 4, p90 39, p99 456, max 31,168 (average 28.9). Most packages ship only a handful of releases, but some heavily maintained libraries have hundreds. **ESM adoption:** 26.3% of packages set `"type": "module"` in their package.json. **TypeScript:** 47.7% of packages bundle their own type declarations. **README coverage:** 87.8% of packages include a README. ## Crawl in progress This dataset is being actively built. Our crawler is working its way through the entire npm registry — all 3,914,826 known packages — fetching full metadata, every published version, maintainer lists, keywords, and dependency graphs for each one. So far we have crawled **1,145,551** of **3,914,826** packages (29.3%), with **2,769,275** remaining. The crawler is running at roughly **35 packages per second**. At the current pace, we expect the initial crawl to finish around **Apr 12, 2026 10:31 UTC**. Once complete, this notice will disappear and the dataset will reflect the full registry. We publish incremental snapshots every 30 minutes, so the data you see here is already usable — it just isn't the whole picture yet. ## Ecosystem snapshot ### License distribution The npm ecosystem leans heavily on permissive licenses. MIT dominates by a wide margin, followed by ISC and Apache-2.0. A small but notable number of packages ship without any license at all, which can create compliance headaches for downstream consumers. ``` MIT ██████████████████████████████ 407,023 ISC ██████████ 137,358 Apache-2.0 ████ 52,162 UNLICENSED █ 13,278 BSD-3-Clause 4,940 GPL-3.0 4,447 SEE LICENSE IN LICENSE 4,038 GPL-3.0-or-later 2,996 MPL-2.0 2,529 OFL-1.1 2,436 ``` ### Most used keywords Keywords give a rough map of what people are building. React, TypeScript, and Node.js dominate the tag cloud, which tracks closely with the frameworks and tools that drive the most npm traffic. ``` react ██████████████████████████████ 34,260 typescript ███████████████████████████ 30,420 hfc ███████████████████████████ 30,273 hyper-function-component ███████████████████████████ 30,273 cli ████████████ 13,751 mcp ███████████ 12,609 icon ██████████ 11,608 material ██████████ 11,568 fluentui ██████████ 11,230 ai ██████████ 11,201 javascript █████████ 9,993 plugin ████████ 8,721 api ████████ 8,567 emoji ███████ 8,385 react-native ███████ 8,334 ``` ### Dependency type breakdown The `dependencies` table covers all five npm dependency types. Runtime dependencies make up the bulk, but devDependencies are also heavily represented since the crawl includes every version of every package. ``` dev ██████████████████████████████ 4,706,294 runtime ████████████████████ 3,189,438 peer ███ 536,043 optional 49,630 bundled 4,796 ``` ## How to download and use this dataset The dataset uses the standard Hugging Face Parquet layout with one config per table. You can query it remotely with DuckDB, stream it with the `datasets` library, or download files individually. ### Using DuckDB DuckDB can read Parquet files directly from Hugging Face without downloading anything first. This is the fastest way to explore the data. ```sql -- Most popular packages by weekly downloads SELECT name, description, latest_version, weekly_downloads, license FROM read_parquet('hf://datasets/open-index/open-npm/data/packages/*.parquet') WHERE weekly_downloads > 0 ORDER BY weekly_downloads DESC LIMIT 20; ``` ```sql -- Packages with bundled TypeScript declarations SELECT name, latest_version, weekly_downloads, CASE WHEN has_types THEN 'bundled' ELSE 'no' END AS typescript FROM read_parquet('hf://datasets/open-index/open-npm/data/packages/*.parquet') WHERE has_types = true ORDER BY weekly_downloads DESC LIMIT 20; ``` ```sql -- Most depended-upon packages (highest dependents count) SELECT name, dependents_count, weekly_downloads, score_final FROM read_parquet('hf://datasets/open-index/open-npm/data/packages/*.parquet') WHERE dependents_count > 0 ORDER BY dependents_count DESC LIMIT 20; ``` ```sql -- License distribution across the ecosystem SELECT license, count(*) AS packages, round(count(*) * 100.0 / sum(count(*)) OVER (), 1) AS pct FROM read_parquet('hf://datasets/open-index/open-npm/data/packages/*.parquet') WHERE license IS NOT NULL AND license != '' GROUP BY license ORDER BY packages DESC LIMIT 15; ``` ```sql -- Most prolific maintainers SELECT username, count(*) AS packages FROM read_parquet('hf://datasets/open-index/open-npm/data/maintainers/*.parquet') GROUP BY username ORDER BY packages DESC LIMIT 20; ``` ```sql -- Average number of versions per package SELECT percentile_disc(0.50) WITHIN GROUP (ORDER BY version_count) AS p50_versions, percentile_disc(0.90) WITHIN GROUP (ORDER BY version_count) AS p90_versions, percentile_disc(0.99) WITHIN GROUP (ORDER BY version_count) AS p99_versions, max(version_count) AS max_versions FROM read_parquet('hf://datasets/open-index/open-npm/data/packages/*.parquet'); ``` ```sql -- Most common runtime dependencies SELECT dep_name, count(*) AS used_by FROM read_parquet('hf://datasets/open-index/open-npm/data/dependencies/*.parquet') WHERE dep_type = 'runtime' GROUP BY dep_name ORDER BY used_by DESC LIMIT 20; ``` ```sql -- Join packages with downloads for a complete view SELECT p.name, p.latest_version, p.license, d.count AS monthly_downloads FROM read_parquet('hf://datasets/open-index/open-npm/data/packages/*.parquet') p JOIN read_parquet('hf://datasets/open-index/open-npm/data/downloads/*.parquet') d ON p.name = d.name AND d.period = 'last-month' ORDER BY d.count DESC LIMIT 20; ``` ### Using `datasets` ```python from datasets import load_dataset # Load the packages table ds = load_dataset("open-index/open-npm", "packages", split="train") print(f"{len(ds):,} packages") # Stream all versions without downloading everything ds = load_dataset("open-index/open-npm", "versions", split="train", streaming=True) for row in ds: print(row["name"], row["version"]) ``` ### Using `huggingface_hub` ```python from huggingface_hub import snapshot_download # Download everything snapshot_download( "open-index/open-npm", repo_type="dataset", local_dir="./npm/", ) # Download only the packages table snapshot_download( "open-index/open-npm", repo_type="dataset", local_dir="./npm/", allow_patterns="data/packages/*.parquet", ) ``` For faster downloads, install `pip install huggingface_hub[hf_transfer]` and set `HF_HUB_ENABLE_HF_TRANSFER=1`. ### Using the CLI ```bash # Download just the packages table huggingface-cli download open-index/open-npm \ --include "data/packages/*" \ --repo-type dataset --local-dir ./npm/ ``` ### Using pandas + DuckDB ```python import duckdb conn = duckdb.connect() # What percentage of packages are ESM modules? df = conn.sql(""" SELECT count(*) FILTER (WHERE type_module) AS esm_packages, count(*) AS total, round(count(*) FILTER (WHERE type_module) * 100.0 / count(*), 1) AS esm_pct FROM read_parquet('hf://datasets/open-index/open-npm/data/packages/*.parquet') """).df() print(df) ``` # Dataset card for npm Registry - Complete Package Archive ## Dataset summary This dataset is a comprehensive snapshot of the [npm registry](https://www.npmjs.com), the default package manager for Node.js and the largest software registry in the world. It covers every package currently published on npm, with full metadata, all historical versions, maintainer relationships, keyword tags, dependency graphs, and download statistics. The dataset is built for research, analysis, and tooling. Some things you can do with it: - **Ecosystem analysis** of the JavaScript/TypeScript package landscape - **Dependency graph research** for supply chain analysis and vulnerability propagation modeling - **Popularity and adoption studies** using download statistics and quality scores - **License compliance** auditing across dependency trees - **Package discovery** and recommendation systems - **Software engineering research** on versioning practices, maintenance patterns, and ecosystem health ## Dataset structure ### Data instances Here is an example row from the `packages` table: ```json { "name": "express", "description": "Fast, unopinionated, minimalist web framework", "license": "MIT", "latest_version": "4.21.2", "version_count": 276, "has_types": false, "type_module": false, "weekly_downloads": 35000000, "dependents_count": 72843, "score_final": 0.94, "repository_url": "https://github.com/expressjs/express", "homepage": "http://expressjs.com/", "created_at": "2010-12-29T19:38:25.450Z", "modified_at": "2025-03-26T02:42:37.718Z" } ``` And a row from the `versions` table: ```json { "name": "express", "version": "4.21.2", "main": "./index.js", "license": "MIT", "node_engine": ">= 0.10.0", "dist_unpack_size": 220983, "dist_file_count": 44, "dist_integrity": "sha512-28HqgMZAmih1Czt9ny7qr6ek2qddF4FclbMzwhCREB6OFfH+rXAnuNCiz1PcSezN6cfMi0+X2HoH6ETjPJOQQ==", "published_at": "2025-01-12T05:27:06.584Z" } ``` ### Data fields #### packages (35+ columns) | Column | Type | Description | |--------|------|-------------| | `name` | VARCHAR | Package name (primary key) | | `description` | VARCHAR | Package description | | `dist_tags_json` | VARCHAR | All dist-tags as JSON `{"latest":"5.2.1","next":"6.0.0-beta"}` | | `license` | VARCHAR | SPDX license identifier | | `homepage` | VARCHAR | Project homepage URL | | `repository_type` | VARCHAR | Repository type (usually "git") | | `repository_url` | VARCHAR | Repository URL | | `repository_directory` | VARCHAR | Monorepo subdirectory path | | `bugs_url` | VARCHAR | Bug tracker URL | | `funding_json` | VARCHAR | Funding information as JSON | | `readme` | VARCHAR | Full README content | | `readme_filename` | VARCHAR | README filename (e.g. "README.md") | | `author_name` | VARCHAR | Package author name | | `author_url` | VARCHAR | Package author URL | | `contributors_json` | VARCHAR | Contributors list as JSON | | `publisher_username` | VARCHAR | Last publisher's npm username | | `has_types` | BOOLEAN | Whether the package bundles TypeScript declarations | | `type_module` | BOOLEAN | Whether package.json has `"type": "module"` (ESM) | | `deprecated` | VARCHAR | Deprecation message (if deprecated) | | `version_count` | INTEGER | Total number of published versions | | `created_at` | TIMESTAMP | When the package was first published | | `modified_at` | TIMESTAMP | When the package was last modified | | `latest_version` | VARCHAR | Latest dist-tag version string | | `latest_published_at` | TIMESTAMP | When the latest version was published | | `latest_unpack_size` | BIGINT | Unpacked size of the latest version in bytes | | `latest_file_count` | INTEGER | Number of files in the latest version | | `latest_node_engine` | VARCHAR | Node.js engine requirement for latest version | | `weekly_downloads` | BIGINT | Weekly download count (from search API) | | `monthly_downloads` | BIGINT | Monthly download count (from search API) | | `dependents_count` | BIGINT | Number of packages that depend on this one | | `score_final` | DOUBLE | Overall quality score (0-1, from search API) | | `score_quality` | DOUBLE | Code quality score component | | `score_popularity` | DOUBLE | Popularity score component | | `score_maintenance` | DOUBLE | Maintenance score component | | `fetched_at` | TIMESTAMP | When this data was crawled | #### versions (35+ columns) | Column | Type | Description | |--------|------|-------------| | `name` | VARCHAR | Package name | | `version` | VARCHAR | Semver version string | | `description` | VARCHAR | Version-specific description | | `main` | VARCHAR | CJS entry point | | `module` | VARCHAR | ESM entry point | | `types` | VARCHAR | TypeScript declarations entry | | `exports_json` | VARCHAR | Conditional exports map as JSON | | `bin_json` | VARCHAR | CLI binary mappings as JSON | | `browser_json` | VARCHAR | Browser field overrides as JSON | | `files_json` | VARCHAR | Included files list as JSON | | `scripts_json` | VARCHAR | npm scripts as JSON | | `os_json` | VARCHAR | Supported operating systems as JSON | | `cpu_json` | VARCHAR | Supported CPU architectures as JSON | | `workspaces_json` | VARCHAR | Monorepo workspaces as JSON | | `peer_deps_meta_json` | VARCHAR | Peer dependency metadata as JSON | | `funding_json` | VARCHAR | Per-version funding info as JSON | | `license` | VARCHAR | Version-specific license | | `deprecated` | VARCHAR | Deprecation message | | `has_install_script` | BOOLEAN | Whether version has install scripts | | `side_effects` | VARCHAR | Tree-shaking hint: "true", "false", or JSON array | | `dist_tarball` | VARCHAR | Tarball download URL | | `dist_shasum` | VARCHAR | SHA-1 hash of the tarball | | `dist_integrity` | VARCHAR | Subresource Integrity hash | | `dist_unpack_size` | BIGINT | Unpacked size in bytes | | `dist_file_count` | INTEGER | Number of files in the tarball | | `dist_signatures_json` | VARCHAR | ECDSA signatures as JSON | | `dist_attest_url` | VARCHAR | Provenance attestation URL | | `node_engine` | VARCHAR | Required Node.js version | | `npm_version` | VARCHAR | npm CLI version used to publish | | `node_version` | VARCHAR | Node.js version used to publish | | `publisher_username` | VARCHAR | npm username of the publisher | | `git_head` | VARCHAR | Git commit SHA at publish time | | `has_shrinkwrap` | BOOLEAN | Whether npm-shrinkwrap.json is included | | `published_at` | TIMESTAMP | When this version was published | #### maintainers | Column | Type | Description | |--------|------|-------------| | `name` | VARCHAR | Package name | | `username` | VARCHAR | npm username | #### keywords | Column | Type | Description | |--------|------|-------------| | `name` | VARCHAR | Package name | | `keyword` | VARCHAR | Keyword tag | #### dependencies | Column | Type | Description | |--------|------|-------------| | `name` | VARCHAR | Package name | | `version` | VARCHAR | Version these deps belong to | | `dep_type` | VARCHAR | One of: `runtime`, `dev`, `peer`, `optional`, `bundled` | | `dep_name` | VARCHAR | Dependency package name | | `dep_range` | VARCHAR | Semver range constraint | #### downloads | Column | Type | Description | |--------|------|-------------| | `name` | VARCHAR | Package name | | `period` | VARCHAR | `last-day`, `last-week`, or `last-month` | | `count` | BIGINT | Total downloads in the period | | `start_date` | VARCHAR | Period start date (YYYY-MM-DD) | | `end_date` | VARCHAR | Period end date (YYYY-MM-DD) | | `fetched_at` | TIMESTAMP | When this was fetched | #### download_days | Column | Type | Description | |--------|------|-------------| | `name` | VARCHAR | Package name | | `day` | VARCHAR | Date (YYYY-MM-DD) | | `count` | BIGINT | Downloads on this day | #### version_downloads | Column | Type | Description | |--------|------|-------------| | `name` | VARCHAR | Package name | | `version` | VARCHAR | Semver version string | | `count` | BIGINT | Downloads for this version | | `period` | VARCHAR | Time period | ### Data splits Each table is a separate dataset configuration. Load them by name: ```python # Load the packages table ds = load_dataset("open-index/open-npm", "packages", split="train") # Load the versions table ds = load_dataset("open-index/open-npm", "versions", split="train") # Load dependencies ds = load_dataset("open-index/open-npm", "dependencies", split="train") ``` ## Dataset creation ### Curation rationale The npm registry is the backbone of the JavaScript ecosystem, but its data is fragmented across four different APIs with varying rate limits, pagination schemes, and response formats. Researchers studying the JavaScript ecosystem, dependency supply chains, or software engineering practices typically need to build their own crawlers and deal with all of this complexity themselves. By publishing a complete, pre-crawled snapshot in Parquet format on Hugging Face, we make the entire registry immediately queryable with DuckDB (via `hf://` paths), streamable with the `datasets` library, and downloadable in bulk. No API keys, no rate limits, no pagination. ### Source data All data is sourced from the official npm APIs: - **Registry API** (`registry.npmjs.org`): Full packument (package document) for every package, containing all versions, maintainers, keywords, and dependencies - **Replication API** (`replicate.npmjs.com`): `_all_docs` endpoint for complete package enumeration - **Downloads API** (`api.npmjs.org`): Point download totals, daily breakdowns, and per-version download counts - **Search API** (`registry.npmjs.org/-/v1/search`): Quality, popularity, and maintenance scores; dependents count; weekly/monthly download aggregates ### Data processing steps The pipeline runs in five phases: 1. **Enumerate.** Page through `replicate.npmjs.com/registry/_all_docs` to collect all package names (about 320 requests for 3.2M+ packages). Names are stored in a local queue with resume support. 2. **Crawl.** Pop package names from the queue and fetch full packuments from `registry.npmjs.org/{name}`. Parse all versions (not just latest) and store metadata, versions, maintainers, keywords, and dependencies. Uses 20 concurrent workers with a single DB writer goroutine. 3. **Download stats.** Batch package names 128 at a time for point download totals. Separate pass for daily range data. Per-version downloads fetched individually. 4. **Search scores.** Fetch quality, popularity, and maintenance scores plus dependents count from the search API. 5. **Export and publish.** Export all 8 DuckDB tables to Parquet with Zstandard compression, generate this README with live statistics, and commit to Hugging Face. No filtering or transformation is applied to the data beyond what the source APIs provide. All fields are preserved as-is. ### Personal and sensitive information This dataset contains npm usernames and user-generated text content (package descriptions, READMEs) as they appear in the public npm registry. Email addresses are stripped during export and are not included in the published Parquet files. The data reflects what is publicly visible on [npmjs.com](https://www.npmjs.com). If you find content in this dataset that you believe should be removed, please open a discussion on the Community tab. ## Considerations for using the data ### Social impact By providing a complete npm registry snapshot in an accessible format, we hope to enable research into software supply chain security, ecosystem health, and the dynamics of open source communities. The dataset can support tools for license compliance, vulnerability tracking, and dependency management. ### Discussion of biases The npm registry reflects the practices and preferences of the JavaScript/TypeScript community. Packages that appear on npm represent only one segment of the broader software ecosystem. Download counts can be inflated by CI/CD pipelines and automated builds, and may not accurately reflect human usage. Quality scores from the search API are computed by npm's own algorithms and reflect their specific weighting of factors. Deprecated packages, malicious packages that were unpublished, and packages with zero downloads are all included in the dataset as they appear in the registry. ### Known limitations - **Snapshot, not live.** This is a point-in-time snapshot. Packages published or updated after the crawl date are not included. - **README content is HTML/Markdown.** The `readme` field contains raw content as stored by npm, which may be Markdown, HTML, or plain text. - **JSON columns are strings.** Fields like `dist_tags_json`, `exports_json`, `bin_json` etc. are stored as VARCHAR containing JSON. Parse them with `json_extract` in DuckDB or `json.loads` in Python. - **Download counts are approximate.** npm's download stats API notes that counts may include automated traffic and are not deduplicated. - **Scoped packages may have incomplete download data.** The bulk downloads API does not support scoped packages (`@scope/name`). These are fetched individually and may have been rate-limited. - **Scores may be zero.** Packages that don't appear in the search API results will have zero values for score and dependents fields. ## Additional information ### Licensing The dataset is released under the **Open Data Commons Attribution License (ODC-By) v1.0**. The original content is subject to the rights of its respective authors and package maintainers. This is an independent community mirror. It is not affiliated with or endorsed by npm, Inc. or GitHub. ### Contact For questions, feedback, or issues, please open a discussion on the [Community tab](https://huggingface.co/datasets/open-index/open-npm/discussions). *Last updated: 2026-04-11 12:37 UTC*
提供机构:
open-index
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作