CinPatent-EN, CinPatent-JA
收藏arXiv2024-03-16 更新2024-06-21 收录
下载链接:
https://github.com/Cinnamon/CinPatent
下载链接
链接失效反馈官方服务:
资源简介:
本研究推出了两个专利分类数据集:CinPatent-EN和CinPatent-JA,分别包含45,131份英文专利和54,657份日文专利。这些数据集通过使用CPC代码从Google专利数据库中收集,旨在解决专利分类任务中的数据访问和性能比较问题。数据集详细记录了专利的标题、摘要、描述和权利要求,平均每份专利有1-3个标签,且标签数量平衡。这些数据集适用于多标签文本分类模型的评估,特别是在低资源语言处理方面具有重要价值。
This study introduces two patent classification datasets: CinPatent-EN and CinPatent-JA, which contain 45,131 English patents and 54,657 Japanese patents respectively. Collected from the Google Patents database using CPC codes, these datasets are designed to address the issues of data access and performance comparison in patent classification tasks. The datasets comprehensively record the title, abstract, description, and claims of each patent, with an average of 1 to 3 labels per patent and balanced label distribution. These datasets are suitable for evaluating multi-label text classification models, and hold significant value particularly for low-resource language processing.
提供机构:
肉桂AI
创建时间:
2022-12-23



