five

The MAREC/IREC data set

收藏
DataCite Commons2024-08-27 更新2024-07-13 收录
下载链接:
https://researchdata.tuwien.ac.at/records/2zx6e-5pr64
下载链接
链接失效反馈
官方服务:
资源简介:
MAREC/IREC: The MAtrixware REsearch Collection / The Information retrieval facility Research Collection MAREC/IREC is a static collection of over 19 million patent applications and granted patents in a unified file format normalized from EP, WO, US, and JP sources, spanning a range from 1976 to June 2008. MAREC/IREC is intended as raw material for research and evaluation in areas such as information retrieval, natural language processing or machine translation, which require large amounts of complex documents. It allows experiments with real data on a realistic scale.The collection contains documents in several languages, the majority being English, German and French, and about half of the documents include full text. In MAREC/IREC, the documents from different countries and sources are normalized to a common XML format with a uniform patent numbering scheme and citation format. The standardized fields include dates, countries, languages, references, person names, and companies as well as rich subject classifications. It is a comparable corpus, where many documents are available in similar versions in other languages. The 19,386,697 XML files measure a total of 621 GB.  IREC - Information retrieval facility Research Collection The MAREC original collection was missing parts of the European Granted Patents claim section (EP-B documents). An EPB_Bugfix folder existed to provide those files corrected. The IREC simply merges the original EPB folder with the EPB_Bugfix in order to provide a uniform representation. The CLEF-IP collections have never been affected by this issue, as they were specially curated. License Information MAREC by IRF is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Permissions beyond the scope of this license may be available at mailto:marec@fandan.net.
提供机构:
TU Wien
创建时间:
2021-11-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作