five

egm517/hupd_augmented

收藏
Hugging Face2022-12-10 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/egm517/hupd_augmented
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: - cc-by-sa-4.0 task_categories: - fill-mask - summarization - text-classification - token-classification task_ids: - masked-language-modeling - multi-class-classification - topic-classification - named-entity-recognition pretty_name: "HUPD" tags: - patents --- # Dataset Card for The Harvard USPTO Patent Dataset (HUPD) ![HUPD-Diagram](https://huggingface.co/datasets/HUPD/hupd/resolve/main/HUPD-Logo.png) ## Dataset Description - **Homepage:** [https://patentdataset.org/](https://patentdataset.org/) - **Repository:** [HUPD GitHub repository](https://github.com/suzgunmirac/hupd) - **Paper:** [HUPD arXiv Submission](https://arxiv.org/abs/2207.04043) - **Point of Contact:** Mirac Suzgun ### Dataset Summary The Harvard USPTO Dataset (HUPD) is a large-scale, well-structured, and multi-purpose corpus of English-language utility patent applications filed to the United States Patent and Trademark Office (USPTO) between January 2004 and December 2018. ### Experiments and Tasks Considered in the Paper - **Patent Acceptance Prediction**: Given a section of a patent application (in particular, the abstract, claims, or description), predict whether the application will be accepted by the USPTO. - **Automated Subject (IPC/CPC) Classification**: Predict the primary IPC or CPC code of a patent application given (some subset of) the text of the application. - **Language Modeling**: Masked/autoregressive language modeling on the claims and description sections of patent applications. - **Abstractive Summarization**: Given the claims or claims section of a patent application, generate the abstract. ### Languages The dataset contains English text only. ### Domain Patents (intellectual property). ### Dataset Curators The dataset was created by Mirac Suzgun, Luke Melas-Kyriazi, Suproteem K. Sarkar, Scott Duke Kominers, and Stuart M. Shieber. ## Dataset Structure Each patent application is defined by a distinct JSON file, named after its application number, and includes information about the application and publication numbers, title, decision status, filing and publication dates, primary and secondary classification codes, inventor(s), examiner, attorney, abstract, claims, background, summary, and full description of the proposed invention, among other fields. There are also supplementary variables, such as the small-entity indicator (which denotes whether the applicant is considered to be a small entity by the USPTO) and the foreign-filing indicator (which denotes whether the application was originally filed in a foreign country). In total, there are 34 data fields for each application. A full list of data fields used in the dataset is listed in the next section. ### Data Instances Each patent application in our patent dataset is defined by a distinct JSON file (e.g., ``8914308.json``), named after its unique application number. The format of the JSON files is as follows: ```python { "application_number": "...", "publication_number": "...", "title": "...", "decision": "...", "date_produced": "...", "date_published": "...", "main_cpc_label": "...", "cpc_labels": ["...", "...", "..."], "main_ipcr_label": "...", "ipcr_labels": ["...", "...", "..."], "patent_number": "...", "filing_date": "...", "patent_issue_date": "...", "abandon_date": "...", "uspc_class": "...", "uspc_subclass": "...", "examiner_id": "...", "examiner_name_last": "...", "examiner_name_first": "...", "examiner_name_middle": "...", "inventor_list": [ { "inventor_name_last": "...", "inventor_name_first": "...", "inventor_city": "...", "inventor_state": "...", "inventor_country": "..." } ], "abstract": "...", "claims": "...", "background": "...", "summary": "...", "full_description": "..." } ``` ## Usage ### Loading the Dataset #### Sample (January 2016 Subset) The following command can be used to load the `sample` version of the dataset, which contains all the patent applications that were filed to the USPTO during the month of January in 2016. This small subset of the dataset can be used for debugging and exploration purposes. ```python from datasets import load_dataset dataset_dict = load_dataset('HUPD/hupd', name='sample', data_files="https://huggingface.co/datasets/HUPD/hupd/blob/main/hupd_metadata_2022-02-22.feather", icpr_label=None, train_filing_start_date='2016-01-01', train_filing_end_date='2016-01-21', val_filing_start_date='2016-01-22', val_filing_end_date='2016-01-31', ) ``` #### Full Dataset If you would like to use the **full** version of the dataset, please make sure that change the `name` field from `sample` to `all`, specify the training and validation start and end dates carefully, and set `force_extract` to be `True` (so that you would only untar the files that you are interested in and not squander your disk storage space). In the following example, for instance, we set the training set year range to be [2011, 2016] (inclusive) and the validation set year range to be 2017. ```python from datasets import load_dataset dataset_dict = load_dataset('HUPD/hupd', name='all', data_files="https://huggingface.co/datasets/HUPD/hupd/blob/main/hupd_metadata_2022-02-22.feather", icpr_label=None, force_extract=True, train_filing_start_date='2011-01-01', train_filing_end_date='2016-12-31', val_filing_start_date='2017-01-01', val_filing_end_date='2017-12-31', ) ``` ### Google Colab Notebook You can also use the following Google Colab notebooks to explore HUPD. - [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1_ZsI7WFTsEO0iu_0g3BLTkIkOUqPzCET?usp=sharing)[ HUPD Examples: Loading the Dataset](https://colab.research.google.com/drive/1_ZsI7WFTsEO0iu_0g3BLTkIkOUqPzCET?usp=sharing) - [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1TzDDCDt368cUErH86Zc_P2aw9bXaaZy1?usp=sharing)[ HUPD Examples: Loading HUPD By Using HuggingFace's Libraries](https://colab.research.google.com/drive/1TzDDCDt368cUErH86Zc_P2aw9bXaaZy1?usp=sharing) - [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1TzDDCDt368cUErH86Zc_P2aw9bXaaZy1?usp=sharing)[ HUPD Examples: Using the HUPD DistilRoBERTa Model](https://colab.research.google.com/drive/11t69BWcAVXndQxAOCpKaGkKkEYJSfydT?usp=sharing) - [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1TzDDCDt368cUErH86Zc_P2aw9bXaaZy1?usp=sharing)[ HUPD Examples: Using the HUPD T5-Small Summarization Model](https://colab.research.google.com/drive/1VkCtrRIryzev_ixDjmJcfJNK-q6Vx24y?usp=sharing) ## Dataset Creation ### Source Data HUPD synthesizes multiple data sources from the USPTO: While the full patent application texts were obtained from the USPTO Bulk Data Storage System (Patent Application Data/XML Versions 4.0, 4.1, 4.2, 4.3, 4.4 ICE, as well as Version 1.5) as XML files, the bibliographic filing metadata were obtained from the USPTO Patent Examination Research Dataset (in February, 2021). ### Annotations Beyond our patent decision label, for which construction details are provided in the paper, the dataset does not contain any human-written or computer-generated annotations beyond those produced by patent applicants or the USPTO. ### Data Shift A major feature of HUPD is its structure, which allows it to demonstrate the evolution of concepts over time. As we illustrate in the paper, the criteria for patent acceptance evolve over time at different rates, depending on category. We believe this is an important feature of the dataset, not only because of the social scientific questions it raises, but also because it facilitates research on models that can accommodate concept shift in a real-world setting. ### Personal and Sensitive Information The dataset contains information about the inventor(s) and examiner of each patent application. These details are, however, already in the public domain and available on the USPTO's Patent Application Information Retrieval (PAIR) system, as well as on Google Patents and PatentsView. ### Social Impact of the Dataset The authors of the dataset hope that HUPD will have a positive social impact on the ML/NLP and Econ/IP communities. They discuss these considerations in more detail in [the paper](https://arxiv.org/abs/2207.04043). ### Impact on Underserved Communities and Discussion of Biases The dataset contains patent applications in English, a language with heavy attention from the NLP community. However, innovation is spread across many languages, cultures, and communities that are not reflected in this dataset. HUPD is thus not representative of all kinds of innovation. Furthermore, patent applications require a fixed cost to draft and file and are not accessible to everyone. One goal of this dataset is to spur research that reduces the cost of drafting applications, potentially allowing for more people to seek intellectual property protection for their innovations. ### Discussion of Biases Section 4 of [the HUPD paper](https://arxiv.org/abs/2207.04043) provides an examination of the dataset for potential biases. It shows, among other things, that female inventors are notably underrepresented in the U.S. patenting system, that small and micro entities (e.g., independent inventors, small companies, non-profit organizations) are less likely to have positive outcomes in patent obtaining than large entities (e.g., companies with more than 500 employees), and that patent filing and acceptance rates are not uniformly distributed across the US. Our empirical findings suggest that any study focusing on the acceptance prediction task, especially if it is using the inventor information or the small-entity indicator as part of the input, should be aware of the the potential biases present in the dataset and interpret their results carefully in light of those biases. - Please refer to Section 4 and Section D for an in-depth discussion of potential biases embedded in the dataset. ### Licensing Information HUPD is released under the CreativeCommons Attribution-NonCommercial-ShareAlike 4.0 International. ### Citation Information ``` @article{suzgun2022hupd, title={The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications}, author={Suzgun, Mirac and Melas-Kyriazi, Luke and Sarkar, Suproteem K. and Kominers, Scott Duke and Shieber, Stuart M.}, year={2022}, publisher={arXiv preprint arXiv:2207.04043}, url={https://arxiv.org/abs/2207.04043}, ```
提供机构:
egm517
原始信息汇总

数据集概述

数据集名称

  • 名称:The Harvard USPTO Patent Dataset (HUPD)
  • 别名:HUPD

数据集描述

  • 摘要:HUPD是一个大规模、结构良好的多用途英语专利申请语料库,涵盖了2004年1月至2018年12月期间向美国专利商标局(USPTO)提交的实用新型专利申请。
  • 领域:专利(知识产权)
  • 语言:仅包含英语文本

数据集内容

  • 数据结构:每个专利申请由一个独立的JSON文件定义,文件名基于其申请号,包含34个数据字段,如申请和出版号、标题、决策状态、提交和出版日期、主要和次要分类代码、发明者、审查员、律师、摘要、声明、背景、摘要和完整发明描述等。

数据集任务

  • 专利接受预测:根据专利申请的摘要、声明或描述部分,预测申请是否会被USPTO接受。
  • 自动主题(IPC/CPC)分类:根据申请文本预测专利申请的主要IPC或CPC代码。
  • 语言建模:在专利申请的声明和描述部分进行掩码/自回归语言建模。
  • 摘要生成:根据专利申请的声明或声明部分生成摘要。

数据集创建

  • 源数据:数据来自USPTO的多个来源,包括全文本从USPTO批量数据存储系统获取,而文献提交元数据从USPTO专利审查研究数据集获取。
  • 注释:除了专利决策标签外,数据集不包含任何人工编写或计算机生成的注释,这些注释由专利申请人或USPTO产生。

数据集使用

  • 加载数据集:数据集可以通过指定不同的参数(如样本版本或全量版本)来加载,用于调试和探索目的或进行全面的研究。

数据集影响

  • 社会影响:数据集旨在对ML/NLP和Econ/IP社区产生积极的社会影响,详细讨论见论文。
  • 偏见讨论:数据集包含对潜在偏见的讨论,包括女性发明者的代表性不足、小型和微型实体在专利获取中的不利地位,以及专利提交和接受率在美国的不均匀分布。

许可证信息

  • 许可证:CreativeCommons Attribution-NonCommercial-ShareAlike 4.0 International
搜集汇总
数据集介绍
main_image_url
构建方式
在知识产权领域,专利文献的数字化与结构化处理对于推动自然语言处理研究至关重要。哈佛大学美国专利商标局数据集(HUPD)的构建,源于对2004年至2018年间提交至美国专利商标局的英文实用专利申请的系统性整合。该数据集通过融合多个权威数据源实现:专利全文文本源自美国专利商标局批量数据存储系统,以XML格式获取;而申请元数据则提取自专利审查研究数据集。每个专利申请均被编码为独立的JSON文件,涵盖申请号、标题、审查决定、分类代码、发明人、摘要、权利要求及详细描述等34个字段,形成了一套高度结构化、机器可读的语料库。
特点
该数据集在专利文本分析领域展现出多重显著特征。其规模宏大,时间跨度长达十五年,为研究技术演进与概念漂移提供了丰富时序数据。结构上,每个样本以标准化JSON格式封装,确保了字段的一致性与可扩展性。内容层面,不仅包含专利文本的核心组成部分如摘要与权利要求,还整合了审查结果、分类标签及实体信息,支持从语言建模到分类预测的多元任务。尤为重要的是,数据集揭示了专利体系中的社会性偏差,如女性发明者代表性不足、小型实体获准率较低等,促使研究者在模型开发中关注公平性与代表性。
使用方法
为便利学术探索与模型开发,该数据集提供了灵活的加载方式。研究者可通过Hugging Face的datasets库,使用load_dataset函数进行调用。数据集提供样本版与完整版两种配置:样本版聚焦2016年1月的申请,适用于初步调试与探索;完整版则覆盖全部年份,用户可通过设定训练集与验证集的申请日期范围来提取特定时段数据,并结合force_extract参数优化存储效率。此外,项目附带的Google Colab笔记本进一步演示了数据加载、预训练模型应用及摘要生成等典型工作流程,为跨任务的研究实践提供了即用型范例。
背景与挑战
背景概述
在知识产权与自然语言处理交叉领域,专利文本的自动化分析长期面临数据稀缺与结构混乱的挑战。哈佛大学USPTO专利数据集(HUPD)由Mirac Suzgun、Luke Melas-Kyriazi等学者于2022年构建,旨在提供一个大规模、结构化的英文专利语料库,涵盖2004年至2018年间美国专利商标局受理的专利申请。该数据集的核心研究问题聚焦于专利接受预测、自动化主题分类及专利文本的语言建模,为机器学习与经济学研究提供了关键基础设施,显著推动了智能专利审查与创新趋势分析领域的发展。
当前挑战
该数据集致力于解决专利文本多任务处理的复杂挑战,包括专利接受预测中的类别不平衡与时间偏移问题,以及自动化分类中细粒度技术领域的精确划分难题。在构建过程中,研究者需整合多版本USPTO异构数据源,确保数万份专利申请文本的结构化解析与字段对齐,同时应对专利法律术语的领域特异性与文本长度变异带来的处理困难。此外,数据集中隐含的性别、实体规模与地域分布偏差,亦对模型的公平性与泛化能力提出了严峻考验。
常用场景
经典使用场景
在知识产权与自然语言处理交叉领域,哈佛USPTO专利数据集(HUPD)为研究者提供了大规模、结构化的专利文本资源。其经典使用场景集中于专利接受预测任务,通过分析专利申请的摘要、权利要求或描述部分,构建机器学习模型以预测美国专利商标局(USPTO)的审批结果。这一场景不仅深化了对专利审查机制的理解,还为自动化决策系统提供了实证基础,推动了智能知识产权分析的发展。
解决学术问题
该数据集有效解决了专利文本分析中的若干核心学术问题。在分类任务中,它支持自动化主题分类(IPC/CPC代码预测),帮助研究者探索技术领域的语义结构;在语言建模方面,通过掩码或自回归模型对权利要求和描述进行训练,提升了专业领域文本的表示能力。此外,数据集的时间跨度揭示了概念漂移现象,为动态环境下的模型适应性研究提供了独特视角,促进了跨学科的知识发现。
衍生相关工作
围绕HUPD衍生的经典工作丰富多样。原始论文中提出的DistilRoBERTa模型和T5-small摘要模型,已成为专利文本处理的基准工具。后续研究进一步拓展至偏见分析领域,探讨女性发明者代表性不足等社会议题;亦有工作利用时间序列特性,开发适应概念漂移的预测算法。这些成果不仅巩固了数据集在学术界的地位,还激发了经济学、法学与人工智能的跨学科对话。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作