swedish-patent-cpc-group
收藏数据集概述
基本信息
- 数据集名称:Historical Swedish Patent CPC Group Classification
- 许可证:MIT
- 语言:瑞典语(sv)
- 任务类型:多标签文本分类
- 标签:专利、CPC、分类、法律、多标签
数据规模
- 总样本数:27,339个文档
- 训练集样本数:27,339个(单一拆分)
- 下载大小:42,432,639字节
- 数据集大小:87,590,812字节
数据特征
字段说明
text:专利权利要求文本(瑞典语),从OCR处理的历史文档中提取label:CPC组分类代码列表(多标签格式)dokID:唯一文档标识符dokNr:专利文档编号dokKod:文档代码分类full_cpc_codes:分配给文档的所有CPC代码的完整列表
标签分布
- 唯一标签数量:1,781个CPC组标签
- 单标签文档:16,110个(58.9%)
- 多标签文档:11,229个(41.1%)
- 平均每个文档标签数:1.65
- 最大标签数:15
- 平均每个标签文档数:25.37
标签数量分布
- 1个标签:16,111个文档(58.9%)
- 2个标签:7,083个文档(25.9%)
- 3个标签:2,658个文档(9.7%)
- 4个标签:935个文档(3.4%)
- 5+个标签:552个文档(2.0%)
数据来源与处理
数据来源
- 历史瑞典专利文档(1885-1972年)
- 文本通过光学字符识别(OCR)从原始纸质文档生成
- 所有CPC标签由PRV(瑞典专利注册局)的专利工程师手动分配
预处理
- 使用spaCy的"sv_core_news_sm"管道自动提取专利权利要求
- 移除与索引代码和Y部分相关的CPC标签
- 过滤掉关联文档少于10个的标签
- 包含主要组和子组的CPC组级别标签
分类挑战
- 极端类别不平衡
- 数据稀疏性
- 细粒度分类需求
- 多标签复杂性
- 历史文本质量变化
- 大规模标签词汇表(1,781个可能标签)
推荐评估指标
- Precision@k、Recall@k、F1@k(k = 1, 3, 5)
- nDCG@k(k = 1, 3, 5)
- Micro-F1
- Macro-F1
使用方式
python from datasets import load_dataset dataset = load_dataset("atheer2104/swedish-patent-cpc-group")
相关研究
该数据集创建于研究历史瑞典专利分类的背景下,探讨结合传统分类信息(DPK)是否能提高细粒度CPC分类性能。
局限性
- 严重的类别不平衡
- OCR质量变化
- 历史范围限制(1885-1972年瑞典专利)
- 语言特异性(瑞典语)
- 过滤效应(每个标签至少10个文档的阈值)
- 细粒度分类复杂性
引用信息
bibtex @mastersthesis{Salim2025PatentClassification, author = {Salim, Atheer}, institution = {KTH, School of Electrical Engineering and Computer Science (EECS)}, pages = {70}, school = {KTH, School of Electrical Engineering and Computer Science (EECS)}, title = {Machine Learning for Classifying Historical Swedish Patents: A Comparison of Textual and Combined Data Approaches}, series = {TRITA-EECS-EX}, number = {2025:571}, keywords = {Multi-label Text Classification, Machine Learning, Patent Classification, Deep Learning, Natural Language Processing, Textklassificering med flera Klasser, Maskininlärning, Patentklassificering, Djupinlärning, Språkteknologi}, abstract = {Patents are essential for protecting intellectual property and advancing innovation, but the accessibility of historical patents is often limited by outdated classification systems. The thesis investigates whether incorporating DPK information alongside patent claims improves the performance of machine learning models in classifying historical Swedish patents into the CPC system. While group-level predictions were more challenging due to label imbalance and fewer samples per label, the research provides valuable insights into fine-grained patent classification challenges.}, year = {2025}, url = "https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-368254" }




