Timbrt/SciOL-CI

Name: Timbrt/SciOL-CI
Creator: Timbrt
Published: 2024-04-17 18:47:42
License: 暂无描述

Hugging Face2024-04-17 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Timbrt/SciOL-CI

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en size_categories: - 10M<n<100M pretty_name: Scientific Openly-Licensed Publications - Caption Images configs: - config_name: default data_files: - split: train path: train*/*.tar - split: validation path: dev/*.tar - split: test path: test/*.tar --- # Scientific Openly-Licensed Publications This repository contains companion material for the following [publication](https://openaccess.thecvf.com/content/WACV2024/papers/Tarsi_SciOL_and_MuLMS-Img_Introducing_a_Large-Scale_Multimodal_Scientific_Dataset_and_WACV_2024_paper.pdf): > Tim Tarsi, Heike Adel, Jan Hendrik Metzen, Dan Zhang, Matteo Finco, Annemarie Friedrich. **SciOL and MuLMS-Img: Introducing A Large-Scale Multimodal Scientific Dataset and Models for Image-Text Tasks in the Scientific Domain.** WACV 2024. Please cite this paper if using the dataset, and direct any questions regarding the dataset to [Tim Tarsi](mailto:tim.tarsi@gmail.com) ## Summary Scientific Openly-Licensed Publications (SciOL) is the largest openly-licensed pre-training corpus for multimodal models in the scientific domain, covering multiple sciences including materials science, physics, and computer science. It consists of over 2.7M scientific scientific publications converted into semi-structured data. SciOL contains over 18 Million figure-caption pairs. **Note: This repository only contains the figures and captions of SciOL. For the textual data see:** [SciOL-text](https://huggingface.co/datasets/Timbrt/SciOL-text) ## Data Format We provide the data in the webdataset format, e.g., captions in plain text files and group and compress them together with the images. Each tar file contains 1000 images and captions. Corresponding figures and captions have the same filename (excluding extention). We split the data into a train, test and dev set. ## Citation If you use our dataset in your work, please cite our paper: ``` @InProceedings{Tarsi_2024_WACV, author = {Tarsi, Tim and Adel, Heike and Metzen, Jan Hendrik and Zhang, Dan and Finco, Matteo and Friedrich, Annemarie}, title = {SciOL and MuLMS-Img: Introducing a Large-Scale Multimodal Scientific Dataset and Models for Image-Text Tasks in the Scientific Domain}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {4560-4571} } ``` ## License The SciOL corpus is released under the [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) license.

许可证：创作共用署名4.0（CC BY 4.0）语言：英语规模类别：1000万 < 样本量 < 1亿友好展示名称：科学开放许可出版物——图像标注集配置项： - 配置名称：默认配置数据文件： - 拆分集：训练集，路径：train*/*.tar - 拆分集：验证集，路径：dev/*.tar - 拆分集：测试集，路径：test/*.tar # 科学开放许可出版物（Scientific Openly-Licensed Publications，简称SciOL）本仓库包含以下发表论文的配套资料：[论文链接](https://openaccess.thecvf.com/content/WACV2024/papers/Tarsi_SciOL_and_MuLMS-Img_Introducing_a_Large-Scale_Multimodal_Scientific_Dataset_and_WACV_2024_paper.pdf) > Tim Tarsi、Heike Adel、Jan Hendrik Metzen、Dan Zhang、Matteo Finco、Annemarie Friedrich. **SciOL与MuLMS-Img：面向科学领域图像-文本任务的大规模多模态科学数据集及模型**. WACV 2024. 若使用本数据集，请引用该论文；有关数据集的任何疑问，请联系[Tim Tarsi](mailto:tim.tarsi@gmail.com) ## 摘要科学开放许可出版物（SciOL）是目前科学领域多模态模型预训练语料库中规模最大的开放许可语料库，涵盖材料科学、物理学、计算机科学等多个学科方向。该数据集包含超过270万篇经半结构化处理的科学出版物，以及超过1800万组图像-标注对。 **注意：本仓库仅包含SciOL的图像与标注内容。如需获取文本数据，请访问：[SciOL-text](https://huggingface.co/datasets/Timbrt/SciOL-text)** ## 数据格式本数据集采用WebDataset格式存储：标注以纯文本文件形式保存，并与对应图像打包压缩。每个tar归档文件包含1000组图像与标注。图像与对应标注的文件名（不含扩展名）完全一致。我们将数据集划分为训练集、测试集与验证集。 ## 引用若您的工作中使用了本数据集，请引用如下论文： @InProceedings{Tarsi_2024_WACV, author = {Tarsi, Tim and Adel, Heike and Metzen, Jan Hendrik and Zhang, Dan and Finco, Matteo and Friedrich, Annemarie}, title = {SciOL and MuLMS-Img: Introducing a Large-Scale Multimodal Scientific Dataset and Models for Image-Text Tasks in the Scientific Domain}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {4560-4571} } ## 许可证 SciOL语料库采用[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)许可协议发布。

提供机构：

Timbrt

原始信息汇总

数据集概述

本数据集为以下出版物的配套材料：

5,000+

优质数据集

54 个

任务类型

进入经典数据集