BoostCLIR: JP-EN Relevance Marked Patent Corpus
收藏DataCite Commons2025-01-28 更新2025-04-17 收录
下载链接:
https://heidata.uni-heidelberg.de/citation?persistentId=doi:10.11588/DATA/10001
下载链接
链接失效反馈官方服务:
资源简介:
BoostCLIR is a bilingual (Japanese-English) corpus of patent abstracts, extracted from the
<a href='http://www.ifs.tuwien.ac.at/imp/marec.shtml'>MAREC</a> patent data, and the data from the <a href='http://research.nii.ac.jp/ntcir/data/data-en.html'>NTCIR PatentMT workshop</a> collections, accompanied with relevance judgements for the task of patent prior-art search. <br /><br /> <strong>Important:</strong> The English side of the corpus contains patent IDs as well as the text of the abstracts. The Japanese side only contains patent IDs because of NTCIR copyright restrictions. The Jap
anese patent abstracts can be extracted from full text Japanese patent documents, which are available from the organizers of the NTCIR workshop.
<br /><br /> The corpus contains training, development and testing subsets sampled from non-intersecting time periods. <br /><br /> Relevance judgement for patent retrieval are constructed from patent citations by assigning three integer levels to three categories of relationships, with highest relevance (3) for family patents, lower relevance for patents cited in search reports by patent examiners (2), and lowest relevance level (1) for applicants’ citations. <br /><br /> For a detailed descrip
tion of the corpus construction process, please see the above publication.
BoostCLIR是一个日英双语专利摘要语料库,提取自<a href='http://www.ifs.tuwien.ac.at/imp/marec.shtml'>MAREC</a>专利数据以及<a href='http://research.nii.ac.jp/ntcir/data/data-en.html'>NTCIR PatentMT研讨会</a>的数据集,并附有专利现有技术检索任务的相关性判断。<br/><br/><strong>重要提示:</strong>该语料库的英文部分包含专利ID和摘要文本;日文部分因NTCIR版权限制仅包含专利ID。日文专利摘要可从NTCIR研讨会组织者提供的日文专利全文文档中提取。<br/><br/>该语料库包含从非重叠时间段采样的训练、开发和测试子集。<br/><br/>专利检索的相关性判断基于专利引用构建,通过为三类关系分配三个整数等级实现:家族专利的相关性最高(等级3),专利审查员在检索报告中引用的专利相关性次之(等级2),申请人引用的专利相关性最低(等级1)。<br/><br/>关于语料库构建过程的详细描述,请参见上述出版物。
提供机构:
heiDATA
创建时间:
2014-06-16



