introvoyz041/Medex
收藏Hugging Face2025-12-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/introvoyz041/Medex
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: PMID
dtype: large_string
- name: DOI
dtype: large_string
- name: entity
dtype: large_string
- name: fact
dtype: large_string
- name: MolInfo
struct:
- name: SMILES
dtype: large_string
- name: GeneInfo
struct:
- name: NCBI_Gene_ID
dtype: int64
- name: protein_refseq_id
dtype: large_string
- name: gene_refseq_id
dtype: large_string
- name: ISSN
dtype: large_string
- name: eISSN
dtype: large_string
- name: Journal
dtype: large_string
splits:
- name: train
num_bytes: 12887091678
num_examples: 36308777
download_size: 3490707811
dataset_size: 12887091678
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
tags:
- biology
- chemistry
- medical
- synthetic
---
This is the initial release of the `Medex` dataset, which contains facts about small molecules and genes / proteins extracted from a large number of PubMed articles. Each fact is accompanied by an associated identifier for small molecules and genes / proteins. For small molecules, this is simply the SMILES string, and for genes / proteins it is the NCBI Gene ID.
We also include information about the publication venue for the papers where the fact was retrieved from (journal name, ISSN, and eISSN) to allow for coarse grained filtering by rigor or focus.
As we extract more facts from PubMed we will upload expanded versions here.
The dataset can be loaded with HuggingFace dataset as follows:
```python
from datasets import load_dataset
# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("medexanon/Medex", split="train")
```
Croissant information can be loaded as follows:
```python
import mlcroissant as mlc
croissant_dataset = mlc.Dataset("https://huggingface.co/api/datasets/medexanon/Medex/croissant")
print(croissant_dataset.metadata.record_sets)
```
提供机构:
introvoyz041



