Timbrt/SciOL-text
收藏科学开放许可出版物数据集(SciOL)
概述
科学开放许可出版物(SciOL)是科学领域中最大的开放许可预训练多模态模型语料库,涵盖材料科学、物理学和计算机科学等多个学科。该数据集包含超过270万篇科学出版物,转换为半结构化数据,并包含超过140亿个提取和结构化的文本标记。
数据格式
数据集的注释以JSON格式提供,文件按组压缩为zip文件。提供了一个基本索引,以便通过DOI、PMID或DOAJ ID以及关键词查找注释。
注释结构
注释的结构如下: json { "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "properties": { "doi": { "type": "string" }, "keywords": { "type": "array", "items": { "type": "string" } }, "license": { "type": "string" }, "article": { "type": "object", "properties": { "title": { "type": "string" }, "authors": { "type": "array", "items": { "type": "string" } }, "abstract": { "type": "string" }, "body_text": { "type": "string" }, "bibliography": { "type": "string" } } } } }
引用
如果您在科学研究中使用了该数据集,请引用以下论文:
@InProceedings{Tarsi_2024_WACV, author = {Tarsi, Tim and Adel, Heike and Metzen, Jan Hendrik and Zhang, Dan and Finco, Matteo and Friedrich, Annemarie}, title = {SciOL and MuLMS-Img: Introducing a Large-Scale Multimodal Scientific Dataset and Models for Image-Text Tasks in the Scientific Domain}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {4560-4571} }
许可
SciOL语料库在CC BY 4.0许可下发布。




