five

axiong/pmc_oa

收藏
Hugging Face2023-08-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/axiong/pmc_oa
下载链接
链接失效反馈
官方服务:
资源简介:
# PMC-OA Dataset **News: We have released the PMC-OA dataset. You can choose the subset specifically.** **P.S.** There's something wrong with the huggingface dataset viewer when the dataset scale gets large. So we sample a subset of it to visualize it directly on web. Click [PMC-OA-Demo](https://huggingface.co/datasets/axiong/pmc_oa_demo) to view it. [中文文档](./README.zh.md) - [PMC-OA Dataset](#pmc-oa-dataset) - [Model Zoo](#model-zoo) - [Daraset Structure](#daraset-structure) - [Sample](#sample) ## Model Zoo Check it out if you want to load model pretrained on PMC-OA directly. We plan to release more models pretrained on PMC-OA. Feel free to reach us if the model you want is not included in model zoo for now. Also, we express our thanks to the help from the community. | Model | Link | Provider | | --- | --- | --- | | ViT-L-14 | https://huggingface.co/ryanyip7777/pmc_vit_l_14 | @ryanyip7777 | ## Daraset Structure **PMC-OA** (seperated images, separated caption). - `images.zip`: images folder - `pmc_oa.jsonl`: dataset file of pmc-oa - `pmc_oa_beta.jsonl`: dataset file of pmc-oa-beta ~~- `train.jsonl`: metafile of train set~~ ~~- `valid.jsonl`: metafile of valid set~~ ~~- `test.jsonl`: metafile of test set~~ The difference between PMC-OA & PMC-OA-Beta lies in the methods of processing captions. In PMC-OA, we utilize ChatGPT to help us divide compound captions into seperate ones. While PMC-OA-Beta keeps all the compound ones without division. ## Sample A row in `pmc_oa.jsonl` is shown bellow, ```python { "image": "PMC212319_Fig3_4.jpg", "caption": "A. Real time image of the translocation of ARF1-GFP to the plasma membrane ...", } ``` Explanation to each key - image: path to the image - caption: corresponding to the image
提供机构:
axiong
原始信息汇总

PMC-OA Dataset 概述

数据集结构

  • PMC-OA (分离的图像,分离的标题)
    • images.zip: 图像文件夹
    • pmc_oa.jsonl: PMC-OA 数据集文件
    • pmc_oa_beta.jsonl: PMC-OA-Beta 数据集文件

数据集样本

  • 示例行来自 pmc_oa.jsonl: python { "image": "PMC212319_Fig3_4.jpg", "caption": "A. Real time image of the translocation of ARF1-GFP to the plasma membrane ...", }

  • 键解释:

    • image: 图像路径
    • caption: 与图像对应的描述
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
PMC-OA是一个包含生物医学图像及其标题的大型数据集,总大小为28.8GB,提供两种不同标题处理方式的版本。数据集主要用于文本到图像的任务,并已用于预训练视觉语言模型。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作