five

chaoyi-wu/PMC-Inline

收藏
Hugging Face2023-08-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/chaoyi-wu/PMC-Inline
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation tags: - biology --- # PMC-Inline Dataset - [PMC-Inline Dataset](#pmc-inline-dataset) - [Daraset Structure](#dataset-structure) - [Sample](#sample) This is the text parts and the figure parts can be dowloaded from https://pan.baidu.com/s/1Src_rhXsaOFp8zJ_3zMFsQ?pwd=p3ne. ## Dataset Structure **PMC-Inline** (PMC papers with inline figures). We collect the cc lincense papers from pubmed central and remoce the bib, author info, table and iamge captions in the original paper xml files. Based on the inline figure ref, we link back 11M images into the paper contexts. Each paper is organized as a PMCxxxxxxx.json. ```xxxxxxx``` refers to the paper unique PMCid - ## Sample A json in dataset is organized as bellow, | info | {"article-type": "research-article", "pmid": "17925856", "pmc": "PMC1999654", "publisher-id": "07-PONE-RA-01026R1", "doi": "10.1371/journal.pone.0001008"} | | ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | | text | \nPredicting Spatial Patterns of Plant Recruitment Using Animal-Displacement Kernels\nFor plants ... | | img_ref | [{"id": "pone-0001008-g001", "start": 9177, "end": 9185}, {"id": "pone-0001008-g001", "start": 10715, "end": 10723}, ...] | | | | | Explanation to each key - info: some info. about the paper, like paper type, pmid, pmc id and so on. - text: a string whihc is the paper content. - img_ref: a list which contains which image and where it is referred in the original paper. For example {"id": "pone-0001008-g001", "start": 9177, "end": 9185} denotes the fig pone-0001008-g001 have been metioned in the text string at index 9177-9185. You can get the image form our PMC figure parts, and fig is named unified as ```PMCxxxxxxx_figid.jpg``` like ```PMC1999654_pone-0001008-g001.jpg``` Note that, our PMC figures are collected before PMC-Inline, and during the time window, some papers have been updated. Thus some figures may be missed in our figure base.
提供机构:
chaoyi-wu
原始信息汇总

PMC-Inline Dataset

数据集结构

  • PMC-Inline 数据集包含从PubMed Central收集的cc许可证论文,去除了原始论文XML文件中的参考文献、作者信息、表格和图像标题。
  • 根据内联图像引用,将11M图像重新链接到论文上下文中。
  • 每篇论文组织为一个名为PMCxxxxxxx.json的文件,其中xxxxxxx代表论文的唯一PMCid。

样本

  • 数据集中的JSON文件组织如下:
    • info: 包含论文类型、PMID、PMC ID等信息。
    • text: 包含论文内容。
    • img_ref: 包含图像引用信息,指示图像在原文中的引用位置。

图像获取

  • 图像统一命名为PMCxxxxxxx_figid.jpg,如PMC1999654_pone-0001008-g001.jpg
  • 注意:由于PMC图像收集时间早于PMC-Inline,某些更新后的论文可能存在图像缺失。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作