five

Tellurio/PubMed-MultiLabel-MeSH

收藏
Hugging Face2026-03-29 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Tellurio/PubMed-MultiLabel-MeSH
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 task_categories: - text-classification task_ids: - multi-label-classification tags: - extreme-multi-label - pubmed - mesh - biomedical - nlp language: - en pretty_name: "PubMed MultiLabel Text Classification (MeSH)" size_categories: - 10K<n<100K --- # PubMed MultiLabel Text Classification (MeSH) A dataset of **50,000 PubMed biomedical articles**, each manually annotated by domain experts with **MeSH (Medical Subject Headings)** labels. With **21,918 unique labels** and a mean of ~12.7 labels per document, this is a densely-labeled extreme multi-label classification benchmark. ## Dataset Description | Property | Value | |---|---| | Train examples | 40,000 | | Test examples | 10,000 | | Total unique MeSH labels | 21,918 | | Mean labels per document | ~12.7 | | Median labels per document | 12 | | Max labels per document | 46 | ### Label distribution | Docs per label | # Labels | % of total | |---|---|---| | 1 | 3,990 | 18.2% | | 2–5 | 7,020 | 32.0% | | 6–10 | 3,412 | 15.6% | | 11–50 | 5,518 | 25.2% | | 51–100 | 1,068 | 4.9% | | 101+ | 910 | 4.2% | ### MeSH Root Categories Each label belongs to one or more MeSH root categories. The dataset includes binary indicator columns for the 14 root categories: | Code | Root Category | |---|---| | A | Anatomy | | B | Organisms | | C | Diseases | | D | Chemicals and Drugs | | E | Analytical, Diagnostic and Therapeutic Techniques, and Equipment | | F | Psychiatry and Psychology | | G | Phenomena and Processes | | H | Disciplines and Occupations | | I | Anthropology, Education, Sociology, and Social Phenomena | | J | Technology, Industry, and Agriculture | | L | Information Science | | M | Named Groups | | N | Health Care | | Z | Geographicals | ## Fields | Field | Type | Description | |---|---|---| | `pmid` | string | PubMed article ID | | `title` | string | Article title | | `abstract` | string | Article abstract text | | `label_ids` | list[int] | MeSH label indices (into the 21,918-label vocabulary) | | `label_names` | list[string] | Human-readable MeSH label names | | `mesh_roots` | dict | Binary flags `{"A": 0/1, ..., "Z": 0/1}` for root categories | ## Additional files - **`label_vocab.json`** — ordered list of all 21,918 MeSH label names (index = label ID) - **`label_metadata.jsonl`** — full label metadata including MeSH tree IDs and root categories for hierarchical classification research ## Splits An 80/20 random split with seed 42 (no predefined split exists in the original data). ## Usage ```python from datasets import load_dataset ds = load_dataset("Tellurio/PubMed-MultiLabel-MeSH") example = ds["train"][0] print(example["title"]) print(example["label_names"]) # e.g. ["Humans", "Female", "DNA Probes, HPV", ...] print(example["label_ids"]) # e.g. [5, 2, 0, ...] print(example["mesh_roots"]) # e.g. {"A": 0, "B": 1, "C": 1, ...} ``` ### Loading label metadata for hierarchical / zero-shot approaches Each of the 21,918 MeSH labels has associated tree IDs and root categories stored in `label_metadata.jsonl`. ```python import json from huggingface_hub import hf_hub_download path = hf_hub_download( repo_id="Tellurio/PubMed-MultiLabel-MeSH", filename="label_metadata.jsonl", repo_type="dataset", ) labels = [] with open(path) as f: for line in f: labels.append(json.loads(line)) # Example label entry print(labels[0]) # {"id": 0, "label": "DNA Probes, HPV", "mesh_tree_ids": ["D13.444...", ...], "mesh_roots": ["Chemicals and Drugs [D]"]} ``` ## Source Originally from Kaggle: [PubMed MultiLabel Text Classification Dataset MeSH](https://www.kaggle.com/datasets/owaiskhan9654/pubmed-multilabel-text-classification) by Owais Ahmad. ## Citation ```bibtex @misc{pubmed_multilabel_mesh, author = {Owais Ahmad}, title = {PubMed MultiLabel Text Classification Dataset MeSH}, year = {2022}, publisher = {Kaggle}, url = {https://www.kaggle.com/datasets/owaiskhan9654/pubmed-multilabel-text-classification} } ``` ## License CC0: Public Domain
提供机构:
Tellurio
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作