A Multi-Scale U.S. Patent Dataset for Technological Innovation Systems (2000–2020)

Name: A Multi-Scale U.S. Patent Dataset for Technological Innovation Systems (2000–2020)
Creator: Science Data Bank
Published: 2026-03-16 02:04:02
License: 暂无描述

DataCite Commons2026-03-16 更新2026-05-05 收录

下载链接：

https://www.scidb.cn/detail?dataSetId=5086d2b576b64783b4e0d56f7eee32e8

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset is constructed based on the granted utility patent data from the United States Patent and Trademark Office (USPTO), as made available by PatentsView. It is designed to support cross-scale analysis of technological innovation systems. The data are sourced from multiple PatentsView tables, including basic patent information, CPC classification codes, assignee information and their geographical locations, and patent citation relationships. A reproducible screening and processing pipeline was applied: only US granted utility patents with application years between 2000 and 2020, and with at least one assignee located in one of the 50 U.S. states, were retained. This resulted in a final set of 1,225,373 valid patents.Technologically, the dataset uses the first four digits (subclass) of the CPC classification codes as the granularity for technology demarcation. Only "invention information" category codes are retained to represent the core technological innovation content of the patents. After processing, the dataset covers 624 distinct technology categories. Spatially, based on assignee address information standardized and mapped to three levels—state, county, and city (recording the corresponding FIPS codes or city names). For the firm dimension, publicly available patent-CRSP firm matching data is incorporated to map assignees to unique firm identifiers (permco). Based on the above processing, the dataset establishes a unified "patent-entity-technology" association structure across four entity scales: state, county, city, and firm.For patents involving multiple technologies, a weighted allocation method based on co-citation information is employed to distribute a patent's contribution across different technologies. Contributions are normalized within each patent to obtain the share corresponding to each technology, thereby enabling a more reasonable quantification of technological innovation intensity. The detailed calculation method is described in the associated data paper.The dataset is stored in CSV format (filename: A multi-scale patent dataset for technological innovation systems.csv), using a comma (",") as the field delimiter. Each record corresponds to one patent, totaling 1,225,373 rows. The descriptions of each field are as follows:fyear: Filing year of the patent.patnum: Unique patent identifier in PatentsView.code: List of technologies associated with the patent, represented by 4-digit CPC subclass codes (only invention information categories are included). If multiple technologies are involved, codes are concatenated using a semicolon (";") separator.share: The patent's contribution share corresponding to each technology listed in the code field, in sequential order. The sum of shares for each patent is 1 (for multi-technology patents, shares are calculated using the co-citation weighted method; see the associated data paper for details).state: FIPS code of the state where the assignee is located (2-digit code).county: FIPS code of the county where the assignee is located (5-digit code: first 2 digits represent the state code, last 3 digits represent the county code).city: Name of the city where the assignee is located (prefixed with the two-digit state code to disambiguate cities with the same name);title: Text of the patent title.abstract: Text of the patent abstract.For patents corresponding to multiple entities (e.g., in the state/county/city fields), the relevant entity information is similarly concatenated using a semicolon (";") separator.The city names in the city field are validated and filtered using the Python package geonamescache, retaining only records that can be matched to the U.S. city gazetteer; a state code prefix is appended to the city name to disambiguate cities with identical names.The public version of this dataset does not directly include the firm-scale entity identifier field. Users wishing to construct firm-scale data can download the third-party patent-CRSP firm match table patnum_permco_1976_2024.tsv (download link: https://github.com/mwoeppel/patent-crsp-permco-match). This table provides the correspondence between patents and permco , where permco is the unique permanent identifier for firms in the CRSP database. Users can merge the permco information from this table into the present dataset via the patnum field, a new firm field will be added to record the assignee firm’s unique identifier. Tests indicate that approximately 97% of patents can be successfully matched to firms; unmatched patents will have a blank firm field.The specific code for performing the firm data merge is provided in the script (filename: Merge_firm_data.py). Before execution, please place both this dataset and the downloaded firm match table in the same directory, and set this directory path in the script (file_path). After merging, if a patent corresponds to multiple firms, their permco identifiers will be concatenated in the firm field using a semicolon (";") as the separator.

提供机构：

Science Data Bank

创建时间：

2026-03-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集