allenai/us-patents
收藏Hugging Face2025-12-18 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/allenai/us-patents
下载链接
链接失效反馈官方服务:
资源简介:
us-patents数据集是一个包含约800万项美国专利授权和申请的数据集,时间跨度为1976年至2025年。数据集经过清洗、过滤和格式化,适用于语言模型的预训练。数据集包括专利的唯一标识符、申请日期、专利类型以及标题、摘要和说明书拼接而成的文本内容。处理步骤包括去重、法律程序内容移除、非英语内容移除、低概率文本移除、低信息文本移除、版权声明移除以及设计专利移除。数据集分为训练集和验证集,训练集包含7,820,247个文档,验证集包含79,836个文档。数据集适用于生物学、化学、工程学、计算机科学、材料科学、经济学和商业等多个领域的研究。
The us-patents dataset is a collection of ~ 8M US patent grants and applications from 1976-2025, cleaned, filtered, and formatted for pre-training of language models. The dataset includes unique identifiers, filing dates, patent types, and text content concatenated from titles, abstracts, and specifications. Processing steps include deduplication, removal of legal process content, non-English content, unlikely prose, low-information text, copyright assertions, and design patents. The dataset is divided into training and validation sets, with 7,820,247 documents in the training set and 79,836 documents in the validation set. The dataset is suitable for research in various fields including biology, chemistry, engineering, computer science, material science, economics, and business.
提供机构:
allenai



