five

OpenMed/drugprot-parquet

收藏
Hugging Face2026-02-23 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/OpenMed/drugprot-parquet
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-4.0 size_categories: - 1K<n<10K task_categories: - text-classification - token-classification tags: - biology - medical - biomedical - NLP - relation-extraction - drug-protein-interactions - BioCreative - PubMed - pharmacology - NER dataset_info: features: - name: pmid dtype: string - name: title dtype: string - name: abstract dtype: string - name: text dtype: string - name: entities list: - name: id dtype: string - name: type dtype: string - name: text dtype: string - name: start dtype: int64 - name: end dtype: int64 - name: relations list: - name: type dtype: string - name: arg1 dtype: string - name: arg2 dtype: string splits: - name: train num_examples: 3500 - name: validation num_examples: 750 configs: - config_name: default data_files: - split: train path: data/train.parquet - split: validation path: data/validation.parquet --- # DrugProt (Parquet) A clean, ready-to-use Parquet version of the **DrugProt** corpus from [BioCreative VII Track 1](https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-1/), converted for seamless use with the Hugging Face `datasets` library. DrugProt is a gold-standard corpus of **4,250 PubMed abstracts** annotated for **drug/chemical–protein interactions**, covering **13 fine-grained relation types** and **3 entity types**. It is designed for training and evaluating biomedical relation extraction systems. ## Quick Start ```python from datasets import load_dataset dataset = load_dataset("OpenMed/drugprot-parquet") # Access splits train = dataset["train"] # 3,500 abstracts val = dataset["validation"] # 750 abstracts # Inspect a sample example = train[0] print(example["title"]) print(f"Entities: {len(example['entities'])}") print(f"Relations: {len(example['relations'])}") ``` ## Dataset Description Each example represents a PubMed abstract with expert-annotated entity spans and relation labels: | Field | Type | Description | |-------|------|-------------| | `pmid` | `string` | PubMed article ID | | `title` | `string` | Article title | | `abstract` | `string` | Article abstract | | `text` | `string` | Full text (title + abstract) | | `entities` | `list[dict]` | Annotated entity spans | | `relations` | `list[dict]` | Annotated drug–protein relations | ### Entity Schema Each entity contains: | Field | Type | Description | |-------|------|-------------| | `id` | `string` | Unique entity ID (e.g., `T1`, `T2`) | | `type` | `string` | Entity type: `CHEMICAL`, `GENE-Y`, or `GENE-N` | | `text` | `string` | Surface text of the entity mention | | `start` | `int` | Character offset (start) in `text` field | | `end` | `int` | Character offset (end) in `text` field | **Entity types:** - **CHEMICAL** — Drugs, small molecules, metabolites, and other chemical compounds - **GENE-Y** — Gene/protein mentions that are a normalized, valid gene/protein - **GENE-N** — Gene/protein mentions that are NOT normalized (e.g., protein families, complexes) ### Relation Schema Each relation contains: | Field | Type | Description | |-------|------|-------------| | `type` | `string` | One of 13 relation categories (see below) | | `arg1` | `string` | Entity ID of the first argument | | `arg2` | `string` | Entity ID of the second argument | ## Relation Types (13 Classes) | Relation | Description | Train | Val | |----------|-------------|------:|----:| | `INHIBITOR` | Chemical inhibits the protein | 5,388 | 1,150 | | `DIRECT-REGULATOR` | Chemical directly regulates protein (mechanism unspecified) | 2,247 | 458 | | `SUBSTRATE` | Chemical is a substrate of the enzyme | 2,003 | 494 | | `ACTIVATOR` | Chemical activates the protein | 1,428 | 246 | | `INDIRECT-UPREGULATOR` | Chemical indirectly increases protein activity/expression | 1,378 | 302 | | `INDIRECT-DOWNREGULATOR` | Chemical indirectly decreases protein activity/expression | 1,329 | 332 | | `ANTAGONIST` | Chemical acts as antagonist of the receptor/protein | 972 | 218 | | `PRODUCT-OF` | Chemical is a product of the enzyme | 920 | 158 | | `PART-OF` | Chemical is part of the protein complex | 885 | 257 | | `AGONIST` | Chemical acts as agonist of the receptor/protein | 658 | 131 | | `AGONIST-ACTIVATOR` | Chemical is both agonist and activator | 29 | 10 | | `SUBSTRATE_PRODUCT-OF` | Chemical is both substrate and product | 24 | 3 | | `AGONIST-INHIBITOR` | Chemical is agonist but inhibits downstream effects | 13 | 2 | | **Total** | | **17,274** | **3,761** | ## Dataset Statistics | | Train | Validation | Total | |--|------:|-----------:|------:| | Abstracts | 3,500 | 750 | **4,250** | | Abstracts with relations | 2,433 | — | — | | Total entities | 89,529 | — | — | |  • CHEMICAL | 46,274 (51.7%) | — | — | |  • GENE-Y | 28,421 (31.7%) | — | — | |  • GENE-N | 14,834 (16.6%) | — | — | | Total relations | 17,274 | 3,761 | **21,035** | | Avg. entities / abstract | 25.6 | — | — | | Avg. relations / abstract | 4.9 | — | — | ## Usage Examples ### Relation Extraction ```python from datasets import load_dataset ds = load_dataset("OpenMed/drugprot-parquet", split="train") for example in ds: entities = {e["id"]: e for e in example["entities"]} for rel in example["relations"]: arg1 = entities[rel["arg1"]] arg2 = entities[rel["arg2"]] print(f"{arg1['text']} --[{rel['type']}]--> {arg2['text']}") ``` ### Named Entity Recognition (NER) ```python from datasets import load_dataset ds = load_dataset("OpenMed/drugprot-parquet", split="train") for example in ds: text = example["text"] for ent in example["entities"]: span = text[ent["start"]:ent["end"]] assert span == ent["text"], f"Offset mismatch: '{span}' != '{ent['text']}'" print(f"[{ent['type']}] {ent['text']} ({ent['start']}:{ent['end']})") ``` ### Convert to Token Classification Format ```python from datasets import load_dataset ds = load_dataset("OpenMed/drugprot-parquet", split="train") # Build BIO tags from character offsets example = ds[0] text = example["text"] char_labels = ["O"] * len(text) for ent in sorted(example["entities"], key=lambda e: e["start"]): tag = ent["type"] char_labels[ent["start"]] = f"B-{tag}" for i in range(ent["start"] + 1, ent["end"]): char_labels[i] = f"I-{tag}" ``` ## Source This dataset is a Parquet conversion of the [DrugProt BioCreative VII corpus](https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-1/). The original data was released under **CC BY 4.0** for the BioCreative VII shared task. **Original paper:** > M. Krallinger, O. Rabal, A. Lourenco, J. Oyarzabal, A. Valencia. > *"Overview of the BioCreative VII Track 1 – DrugProt: Drug-Protein Relation Extraction."* > Proceedings of the BioCreative VII Challenge Evaluation Workshop, 2021. **BibTeX:** ```bibtex @inproceedings{drugprot2021, title={Overview of the BioCreative VII Track 1 -- DrugProt: Drug-Protein Relation Extraction}, author={Krallinger, Martin and Rabal, Obdulia and Lourenco, Analia and Oyarzabal, Julen and Valencia, Alfonso}, booktitle={Proceedings of the BioCreative VII Challenge Evaluation Workshop}, year={2021} } ``` ## License CC BY 4.0 — following the original DrugProt corpus license. ## About OpenMed [OpenMed](https://huggingface.co/OpenMed) provides clean, standardized biomedical datasets and RL training environments for medical AI research.
提供机构:
OpenMed
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作