five

rhahn/patentbench

收藏
Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/rhahn/patentbench
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation - question-answering language: - en tags: - legal - patent - benchmark - prosecution - evaluation size_categories: - 1K<n<10K configs: - config_name: full data_files: - split: train path: data/full/all_cases.jsonl - config_name: mini data_files: - split: train path: data/mini/tier_1_2_cases.jsonl --- # PatentBench **The First Reproducible Benchmark for Patent Prosecution AI** ## Overview PatentBench evaluates AI systems on real patent prosecution tasks, from parsing USPTO Office Actions to drafting legally sound arguments under 35 U.S.C. sections 101, 102, 103, and 112. Every test case derives from actual USPTO proceedings. Tasks map to billable activities at patent law firms. ## Dataset Structure ### Splits | Split | Cases | Purpose | |-------|-------|---------| | `full` | 7,200 | Complete evaluation across all tiers and domains | | `mini` | 300 | Stratified sample for rapid iteration | ### Schema | Field | Type | Description | |-------|------|-------------| | `id` | string | Unique case identifier | | `domain` | string | `administration`, `prosecution`, `drafting`, or `analytics` | | `tier` | int | Difficulty 1-5 (paralegal to senior partner) | | `task_type` | string | e.g. `deadline_calculation`, `103_argument`, `fee_computation` | | `prompt` | string | The task prompt given to the model | | `reference_answer` | string | Ground truth (JSON string for structured answers) | | `evaluation_layers` | list[str] | Which evaluation layers apply | | `metadata` | dict | Application number, technology center, etc. | ### Task Types (7,200 total) | Task Type | Domain | Count | |-----------|--------|-------| | `fee_computation` | administration | 2,050 | | `deadline_calculation` | administration | 2,049 | | `action_classification` | administration | 954 | | `examiner_extraction` | prosecution | 418 | | `prosecution_history_parsing` | prosecution | 368 | | `timeline_analysis` | administration | 347 | | `prosecution_strategy` | prosecution | 346 | | `technology_center_classification` | prosecution | 321 | | `filing_date_extraction` | administration | 321 | | `103_argument` | prosecution | 12 | | `102_argument` | prosecution | 5 | | `101_argument` | prosecution | 4 | | `112_argument` | prosecution | 3 | | `oa_parsing` | prosecution | 2 | ### Difficulty Distribution | Tier | Level | Count | |------|-------|-------| | 1 | Paralegal | 6,015 | | 2 | Junior Associate | 1,080 | | 3 | Senior Associate | 105 | ## Data Sources All cases are derived from real USPTO data: - **321 USPTO applications** from Patent Examination Data System (PEDS) - **1,103 prosecution events** (Office Actions, allowances, etc.) - **437 Office Actions** (311 Non-Final, 126 Final) across these applications Test cases include generated variants covering all combinations of: - Entity status (micro, small, large) - Extension duration (1, 2, 3 months) - Fee type (filing, search, examination) ## Usage ### With the `datasets` library ```python from datasets import load_dataset ds_full = load_dataset("rhahn/patentbench", "full", split="train") ds_mini = load_dataset("rhahn/patentbench", "mini", split="train") # Filter by task type deadlines = ds_full.filter(lambda x: x["task_type"] == "deadline_calculation") ``` ### With the `patentbench` Python package ```bash pip install patentbench patentbench --model openai:gpt-4o --subset mini ``` ```python from patentbench import DataLoader, BenchmarkRunner loader = DataLoader("data/mini") cases = loader.load_all() ``` ## Evaluation PatentBench uses a 4-layer evaluation framework: 1. **Deterministic**. Binary correctness for objective tasks (deadlines, fees) 2. **LLM-as-Judge**. Calibrated rubric-based scoring (legal accuracy, argument strength) 3. **Comparative**. Blind side-by-side ranking 4. **Human Calibration**. Expert attorney scores ## Links - **GitHub:** https://github.com/rhahn28/patentbench - **PyPI:** https://pypi.org/project/patentbench/ - **Leaderboard:** https://abigail.app/patentbench ## License Apache 2.0

许可证:Apache-2.0 任务类别: - 文本生成 - 问答 语言: - 英语 标签: - 法律 - 专利 - 基准测试集 - 专利审查 - 评估 样本量范围: - 1000 < 样本量 < 10000 配置项: - 配置名称:full(完整配置) 数据文件: - 划分集:训练集(train) 路径:data/full/all_cases.jsonl - 配置名称:mini(迷你配置) 数据文件: - 划分集:训练集(train) 路径:data/mini/tier_1_2_cases.jsonl # 专利基准测试集(PatentBench) **首个可复现的专利审查人工智能基准测试集** ## 概述 专利基准测试集(PatentBench)针对真实专利审查场景下的人工智能系统开展评估,覆盖从解析美国专利商标局(USPTO, United States Patent and Trademark Office)审查意见通知书(Office Actions),到依据美国法典第35编第101、102、103、112条起草合法合规的论证文书等全流程任务。所有测试用例均源自真实的USPTO审查程序,任务映射至专利律所的计费服务活动。 ## 数据集结构 ### 数据集划分 | 划分集 | 样本量 | 用途 | |-------|-------|---------| | `full` | 7200 | 覆盖全层级与全领域的完整评估 | | `mini` | 300 | 用于快速迭代的分层抽样测试 | ### 数据Schema | 字段名 | 数据类型 | 描述 | |-------|------|-------------| | `id` | 字符串 | 唯一案例标识符 | | `domain` | 字符串 | 取值为`administration`(行政事务)、`prosecution`(审查事务)、`drafting`(文书撰写)或`analytics`(分析事务) | | `tier` | 整数 | 难度等级1-5(对应从法务助理到高级合伙人) | | `task_type` | 字符串 | 例如`deadline_calculation`(期限计算)、`103_argument`(第103条论证)、`fee_computation`(费用计算) | | `prompt` | 字符串 | 提交给模型的任务提示词 | | `reference_answer` | 字符串 | 标准答案(结构化答案需以JSON字符串形式呈现) | | `evaluation_layers` | 字符串列表 | 适用的评估层级 | | `metadata` | 字典 | 包含申请号、技术中心等元数据 | ### 任务类型(总计7200个) | 任务类型 | 所属领域 | 样本量 | |-----------|--------|-------| | `fee_computation` | 行政事务 | 2050 | | `deadline_calculation` | 行政事务 | 2049 | | `action_classification` | 行政事务 | 954 | | `examiner_extraction` | 审查事务 | 418 | | `prosecution_history_parsing` | 审查事务 | 368 | | `timeline_analysis` | 行政事务 | 347 | | `prosecution_strategy` | 审查事务 | 346 | | `technology_center_classification` | 审查事务 | 321 | | `filing_date_extraction` | 行政事务 | 321 | | `103_argument` | 审查事务 | 12 | | `102_argument` | 审查事务 | 5 | | `101_argument` | 审查事务 | 4 | | `112_argument` | 审查事务 | 3 | | `oa_parsing` | 审查事务 | 2 | ### 难度分布 | 难度层级 | 对应岗位 | 样本量 | |------|-------|-------| | 1 | 法务助理 | 6015 | | 2 | 初级律师 | 1080 | | 3 | 高级律师 | 105 | ## 数据来源 所有案例均源自真实的USPTO数据: - 来自专利审查数据系统(PEDS, Patent Examination Data System)的321件USPTO专利申请 - 1103项审查事件(包括审查意见通知书、授权通知等) - 437份审查意见通知书(其中311份为非最终审查意见,126份为最终审查意见) 测试用例包含覆盖所有组合的生成变体: - 实体类型(微型、小型、大型实体) - 延长期限(1、2、3个月) - 费用类型(申请费、检索费、审查费) ## 使用方法 ### 使用`datasets`库 python from datasets import load_dataset ds_full = load_dataset("rhahn/patentbench", "full", split="train") ds_mini = load_dataset("rhahn/patentbench", "mini", split="train") # 按任务类型筛选样本 deadlines = ds_full.filter(lambda x: x["task_type"] == "deadline_calculation") ### 使用`patentbench` Python包 bash pip install patentbench patentbench --model openai:gpt-4o --subset mini python from patentbench import DataLoader, BenchmarkRunner loader = DataLoader("data/mini") cases = loader.load_all() ## 评估方法 专利基准测试集(PatentBench)采用四层评估框架: 1. **确定性评估**:针对客观任务(如期限计算、费用计算)开展二元正确性校验 2. **大语言模型(Large Language Model,LLM)作为评判者**:基于校准后的评分准则进行评分(涵盖法律准确性、论证强度等维度) 3. **对比性评估**:采用盲测式并排排名的方式 4. **人工校准**:由专业律师进行人工评分 ## 相关链接 - **GitHub仓库**:https://github.com/rhahn28/patentbench - **PyPI包索引**:https://pypi.org/project/patentbench/ - **排行榜页面**:https://abigail.app/patentbench ## 许可证 Apache-2.0
提供机构:
rhahn
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作