OpenMed/drugprot-parquet
收藏Hugging Face2026-02-23 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/OpenMed/drugprot-parquet
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
size_categories:
- 1K<n<10K
task_categories:
- text-classification
- token-classification
tags:
- biology
- medical
- biomedical
- NLP
- relation-extraction
- drug-protein-interactions
- BioCreative
- PubMed
- pharmacology
- NER
dataset_info:
features:
- name: pmid
dtype: string
- name: title
dtype: string
- name: abstract
dtype: string
- name: text
dtype: string
- name: entities
list:
- name: id
dtype: string
- name: type
dtype: string
- name: text
dtype: string
- name: start
dtype: int64
- name: end
dtype: int64
- name: relations
list:
- name: type
dtype: string
- name: arg1
dtype: string
- name: arg2
dtype: string
splits:
- name: train
num_examples: 3500
- name: validation
num_examples: 750
configs:
- config_name: default
data_files:
- split: train
path: data/train.parquet
- split: validation
path: data/validation.parquet
---
# DrugProt (Parquet)
A clean, ready-to-use Parquet version of the **DrugProt** corpus from [BioCreative VII Track 1](https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-1/), converted for seamless use with the Hugging Face `datasets` library.
DrugProt is a gold-standard corpus of **4,250 PubMed abstracts** annotated for **drug/chemical–protein interactions**, covering **13 fine-grained relation types** and **3 entity types**. It is designed for training and evaluating biomedical relation extraction systems.
## Quick Start
```python
from datasets import load_dataset
dataset = load_dataset("OpenMed/drugprot-parquet")
# Access splits
train = dataset["train"] # 3,500 abstracts
val = dataset["validation"] # 750 abstracts
# Inspect a sample
example = train[0]
print(example["title"])
print(f"Entities: {len(example['entities'])}")
print(f"Relations: {len(example['relations'])}")
```
## Dataset Description
Each example represents a PubMed abstract with expert-annotated entity spans and relation labels:
| Field | Type | Description |
|-------|------|-------------|
| `pmid` | `string` | PubMed article ID |
| `title` | `string` | Article title |
| `abstract` | `string` | Article abstract |
| `text` | `string` | Full text (title + abstract) |
| `entities` | `list[dict]` | Annotated entity spans |
| `relations` | `list[dict]` | Annotated drug–protein relations |
### Entity Schema
Each entity contains:
| Field | Type | Description |
|-------|------|-------------|
| `id` | `string` | Unique entity ID (e.g., `T1`, `T2`) |
| `type` | `string` | Entity type: `CHEMICAL`, `GENE-Y`, or `GENE-N` |
| `text` | `string` | Surface text of the entity mention |
| `start` | `int` | Character offset (start) in `text` field |
| `end` | `int` | Character offset (end) in `text` field |
**Entity types:**
- **CHEMICAL** — Drugs, small molecules, metabolites, and other chemical compounds
- **GENE-Y** — Gene/protein mentions that are a normalized, valid gene/protein
- **GENE-N** — Gene/protein mentions that are NOT normalized (e.g., protein families, complexes)
### Relation Schema
Each relation contains:
| Field | Type | Description |
|-------|------|-------------|
| `type` | `string` | One of 13 relation categories (see below) |
| `arg1` | `string` | Entity ID of the first argument |
| `arg2` | `string` | Entity ID of the second argument |
## Relation Types (13 Classes)
| Relation | Description | Train | Val |
|----------|-------------|------:|----:|
| `INHIBITOR` | Chemical inhibits the protein | 5,388 | 1,150 |
| `DIRECT-REGULATOR` | Chemical directly regulates protein (mechanism unspecified) | 2,247 | 458 |
| `SUBSTRATE` | Chemical is a substrate of the enzyme | 2,003 | 494 |
| `ACTIVATOR` | Chemical activates the protein | 1,428 | 246 |
| `INDIRECT-UPREGULATOR` | Chemical indirectly increases protein activity/expression | 1,378 | 302 |
| `INDIRECT-DOWNREGULATOR` | Chemical indirectly decreases protein activity/expression | 1,329 | 332 |
| `ANTAGONIST` | Chemical acts as antagonist of the receptor/protein | 972 | 218 |
| `PRODUCT-OF` | Chemical is a product of the enzyme | 920 | 158 |
| `PART-OF` | Chemical is part of the protein complex | 885 | 257 |
| `AGONIST` | Chemical acts as agonist of the receptor/protein | 658 | 131 |
| `AGONIST-ACTIVATOR` | Chemical is both agonist and activator | 29 | 10 |
| `SUBSTRATE_PRODUCT-OF` | Chemical is both substrate and product | 24 | 3 |
| `AGONIST-INHIBITOR` | Chemical is agonist but inhibits downstream effects | 13 | 2 |
| **Total** | | **17,274** | **3,761** |
## Dataset Statistics
| | Train | Validation | Total |
|--|------:|-----------:|------:|
| Abstracts | 3,500 | 750 | **4,250** |
| Abstracts with relations | 2,433 | — | — |
| Total entities | 89,529 | — | — |
| • CHEMICAL | 46,274 (51.7%) | — | — |
| • GENE-Y | 28,421 (31.7%) | — | — |
| • GENE-N | 14,834 (16.6%) | — | — |
| Total relations | 17,274 | 3,761 | **21,035** |
| Avg. entities / abstract | 25.6 | — | — |
| Avg. relations / abstract | 4.9 | — | — |
## Usage Examples
### Relation Extraction
```python
from datasets import load_dataset
ds = load_dataset("OpenMed/drugprot-parquet", split="train")
for example in ds:
entities = {e["id"]: e for e in example["entities"]}
for rel in example["relations"]:
arg1 = entities[rel["arg1"]]
arg2 = entities[rel["arg2"]]
print(f"{arg1['text']} --[{rel['type']}]--> {arg2['text']}")
```
### Named Entity Recognition (NER)
```python
from datasets import load_dataset
ds = load_dataset("OpenMed/drugprot-parquet", split="train")
for example in ds:
text = example["text"]
for ent in example["entities"]:
span = text[ent["start"]:ent["end"]]
assert span == ent["text"], f"Offset mismatch: '{span}' != '{ent['text']}'"
print(f"[{ent['type']}] {ent['text']} ({ent['start']}:{ent['end']})")
```
### Convert to Token Classification Format
```python
from datasets import load_dataset
ds = load_dataset("OpenMed/drugprot-parquet", split="train")
# Build BIO tags from character offsets
example = ds[0]
text = example["text"]
char_labels = ["O"] * len(text)
for ent in sorted(example["entities"], key=lambda e: e["start"]):
tag = ent["type"]
char_labels[ent["start"]] = f"B-{tag}"
for i in range(ent["start"] + 1, ent["end"]):
char_labels[i] = f"I-{tag}"
```
## Source
This dataset is a Parquet conversion of the [DrugProt BioCreative VII corpus](https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-1/). The original data was released under **CC BY 4.0** for the BioCreative VII shared task.
**Original paper:**
> M. Krallinger, O. Rabal, A. Lourenco, J. Oyarzabal, A. Valencia.
> *"Overview of the BioCreative VII Track 1 – DrugProt: Drug-Protein Relation Extraction."*
> Proceedings of the BioCreative VII Challenge Evaluation Workshop, 2021.
**BibTeX:**
```bibtex
@inproceedings{drugprot2021,
title={Overview of the BioCreative VII Track 1 -- DrugProt: Drug-Protein Relation Extraction},
author={Krallinger, Martin and Rabal, Obdulia and Lourenco, Analia and Oyarzabal, Julen and Valencia, Alfonso},
booktitle={Proceedings of the BioCreative VII Challenge Evaluation Workshop},
year={2021}
}
```
## License
CC BY 4.0 — following the original DrugProt corpus license.
## About OpenMed
[OpenMed](https://huggingface.co/OpenMed) provides clean, standardized biomedical datasets and RL training environments for medical AI research.
提供机构:
OpenMed



