axiong/pmc_oa
收藏Hugging Face2023-08-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/axiong/pmc_oa
下载链接
链接失效反馈官方服务:
资源简介:
# PMC-OA Dataset
**News: We have released the PMC-OA dataset. You can choose the subset specifically.**
**P.S.** There's something wrong with the huggingface dataset viewer when the dataset scale gets large.
So we sample a subset of it to visualize it directly on web. Click [PMC-OA-Demo](https://huggingface.co/datasets/axiong/pmc_oa_demo) to view it.
[中文文档](./README.zh.md)
- [PMC-OA Dataset](#pmc-oa-dataset)
- [Model Zoo](#model-zoo)
- [Daraset Structure](#daraset-structure)
- [Sample](#sample)
## Model Zoo
Check it out if you want to load model pretrained on PMC-OA directly.
We plan to release more models pretrained on PMC-OA. Feel free to reach us if the model you want is not included in model zoo for now.
Also, we express our thanks to the help from the community.
| Model | Link | Provider |
| --- | --- | --- |
| ViT-L-14 | https://huggingface.co/ryanyip7777/pmc_vit_l_14 | @ryanyip7777 |
## Daraset Structure
**PMC-OA** (seperated images, separated caption).
- `images.zip`: images folder
- `pmc_oa.jsonl`: dataset file of pmc-oa
- `pmc_oa_beta.jsonl`: dataset file of pmc-oa-beta
~~- `train.jsonl`: metafile of train set~~
~~- `valid.jsonl`: metafile of valid set~~
~~- `test.jsonl`: metafile of test set~~
The difference between PMC-OA & PMC-OA-Beta lies in the methods of processing captions.
In PMC-OA, we utilize ChatGPT to help us divide compound captions into seperate ones.
While PMC-OA-Beta keeps all the compound ones without division.
## Sample
A row in `pmc_oa.jsonl` is shown bellow,
```python
{
"image": "PMC212319_Fig3_4.jpg",
"caption": "A. Real time image of the translocation of ARF1-GFP to the plasma membrane ...",
}
```
Explanation to each key
- image: path to the image
- caption: corresponding to the image
提供机构:
axiong
原始信息汇总
PMC-OA Dataset 概述
数据集结构
- PMC-OA (分离的图像,分离的标题)
images.zip: 图像文件夹pmc_oa.jsonl: PMC-OA 数据集文件pmc_oa_beta.jsonl: PMC-OA-Beta 数据集文件
数据集样本
-
示例行来自
pmc_oa.jsonl: python { "image": "PMC212319_Fig3_4.jpg", "caption": "A. Real time image of the translocation of ARF1-GFP to the plasma membrane ...", } -
键解释:
image: 图像路径caption: 与图像对应的描述
搜集汇总
数据集介绍

背景与挑战
背景概述
PMC-OA是一个包含生物医学图像及其标题的大型数据集,总大小为28.8GB,提供两种不同标题处理方式的版本。数据集主要用于文本到图像的任务,并已用于预训练视觉语言模型。
以上内容由遇见数据集搜集并总结生成



