nirschl-lab/hpa10m
收藏Hugging Face2026-02-12 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nirschl-lab/hpa10m
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
configs:
- config_name: default
data_files:
- split: train
path: "hpa10m_train/*.tar"
- split: validation
path: "hpa10m_validation/*.tar"
dataset_info:
features:
- name: __key__
dtype: string
- name: __url__
dtype: string
- name: jpg
dtype: image
- name: json
struct:
- name: comments
sequence: string
- name: custom_metadata
struct:
- name: area_fraction
dtype: float64
- name: area_px
dtype: int64
- name: bboxes
sequence:
sequence: int64
- name: caption_1
dtype: string
- name: caption_2
dtype: string
- name: cell_type
dtype: string
- name: ensembl_id
dtype: string
- name: file_size_kb
dtype: float64
- name: gene
dtype: string
- name: generic_caption
dtype: string
- name: image_md5
dtype: string
- name: patient_age
dtype: float64
- name: patient_id
dtype: int64
- name: patient_sex
dtype: string
- name: rle_mask
dtype: string
- name: snomed_code
dtype: string
- name: snomed_text
dtype: string
- name: staining_intensity
dtype: string
- name: staining_location
dtype: string
- name: staining_quantity
dtype: string
- name: tissue
dtype: string
- name: uberon_id
dtype: string
- name: uniprot_id
dtype: string
- name: url
dtype: string
- name: instances
sequence:
struct:
- name: attributes
sequence: string
- name: classId
dtype: int64
- name: className
dtype: string
- name: createdAt
dtype: string
- name: creationType
dtype: string
- name: error
dtype: string
- name: exclude
sequence: string
- name: groupId
dtype: int64
- name: id
dtype: string
- name: locked
dtype: bool
- name: pointLabels
struct:
- name: _placeholder
dtype: string
- name: points
sequence: int64
- name: probability
dtype: int64
- name: type
dtype: string
- name: updatedAt
dtype: string
- name: metadata
struct:
- name: annotatorEmail
dtype: string
- name: format
dtype: string
- name: height
dtype: int64
- name: isPredicted
dtype: bool
- name: name
dtype: string
- name: pinned
dtype: bool
- name: projectId
dtype: string
- name: qaEmail
dtype: string
- name: status
dtype: string
- name: width
dtype: int64
- name: tags
sequence: string
---
# HPA10M Dataset
A large-scale immunohistochemistry (IHC) image dataset derived from the Human Protein Atlas (HPA, https://www.proteinatlas.org/), containing approximately **10.5 million** pathology and tissue images with detailed annotations.
## Dataset Overview
| Statistic | Value |
|-----------|-------|
| **Total Images** | 10,495,672 |
| **Training Set** | 10,493,672 images (10,497 tar files) |
| **Validation Set** | 2,000 images (1 tar file) |
| **Image Types** | Pathology (7,970,595) / Tissue (2,525,077) |
| **Format** | JPEG images + JSON metadata |
## Directory Structure
```
hpa10m/
├── README.md # This file
├── example_images/ # Sample images for preview
├── hpa10m_train/ # Training data (WebDataset tar files)
│ ├── hpa10m_train_0000.tar # Training shards (10,497 files)
│ ├── hpa10m_train_0001.tar
│ ├── ...
├── hpa10m_validation/ # Validation data
│ └── hpa10m_validation.tar # All validation samples (2,000 images)
└── hpa10m_tar_summary/ # Metadata index files
└── all.feather # Complete index of all images
```
## Data Format
### Tar Archives (WebDataset Format)
Each tar file contains paired `.jpg` and `.json` files organized by:
- **Image category**: `pathology/` or `tissue/`
- **Gene prefix**: Two-letter gene name prefix (e.g., `AB/`, `CD/`)
### JSON Metadata Structure
Each image has a corresponding JSON file with rich annotations:
```json
{
"metadata": {
"height": 3000,
"width": 3000,
"name": "image_filename.jpg",
"format": ".jpg"
},
"custom_metadata": {
"gene": "TEKT3",
"ensembl_id": "ENSG00000125409",
"uniprot_id": "Q9BXF9",
"tissue": "skin cancer",
"cell_type": "Tumor cells",
"patient_id": 3354,
"patient_age": 92,
"patient_sex": "male",
"snomed_code": "M-80703;T-01000",
"snomed_text": "Squamous cell carcinoma, NOS;Skin",
"staining_intensity": "negative",
"staining_location": "none",
"staining_quantity": "none",
"generic_caption": "Immunohistochemical staining of human skin cancer...",
"caption_1": "Detailed caption describing the image...",
"caption_2": "Alternative caption...",
"url": "http://images.proteinatlas.org/...",
"bboxes": [[x, y, w, h], ...],
"rle_mask": "encoded_segmentation_mask",
"area_px": 3883806,
"area_fraction": 0.431534
}
}
```
### Index Files (Feather Format)
The `hpa10m_tar_summary/all.feather` file contains an index of all images with columns:
| Column | Description |
|--------|-------------|
| `tar_filename` | Source tar archive name |
| `split` | Dataset split (train/validation) |
| `name` | Full path within tar archive |
| `type` | Image type (pathology/tissue) |
| `img_offset` | Byte offset of image in tar |
| `img_size` | Image file size in bytes |
| `json_offset` | Byte offset of JSON in tar |
| `json_size` | JSON file size in bytes |
## Key Annotations
### Clinical Information
- `gene`: Gene name (e.g., "TEKT3")
- `ensembl_id`: Ensembl gene ID (e.g., "ENSG00000125409")
- `uniprot_id`: UniProt protein ID (e.g., "Q9BXF9")
- `tissue`: Tissue or cancer type (e.g., "skin cancer")
- `uberon_id`: UBERON ontology ID
- `cell_type`: Cell type (e.g., "Tumor cells")
- `patient_id`: Patient identifier
- `patient_age`: Patient age
- `patient_sex`: Patient sex ("male" / "female")
- `snomed_code`: SNOMED-CT code (e.g., "M-80703;T-01000")
- `snomed_text`: SNOMED-CT description (e.g., "Squamous cell carcinoma, NOS;Skin")
### Staining Characteristics
- `staining_intensity`: "negative", "weak", "moderate", "strong"
- `staining_location`: "nuclear", "cytoplasmic/membranous", "cytoplasmic/membranous,nuclear", "none"
- `staining_quantity`: "none", "<25%", "25-75%", ">75%"
### Segmentation Data
- `bboxes`: Bounding boxes in `[[x, y, width, height], ...]` format
- `rle_mask`: Segmentation mask
- `area_px`: Segmented area in pixels
- `area_fraction`: Fraction of image covered by segmentation
### Natural Language Captions
- `generic_caption`: Standardized description
- `caption_1`: Detailed scientific description
- `caption_2`: Alternative description
### Other Metadata
- `url`: Original image URL from Human Protein Atlas
- `image_md5`: MD5 hash of original image
- `file_size_kb`: Image file size in KB
## Usage
### Loading Index with Pandas
```python
import pandas as pd
# Load complete index
df = pd.read_feather("hpa10m_tar_summary/all.feather")
# Filter by split
train_df = df[df["split"] == "train"]
val_df = df[df["split"] == "validation"]
# Filter by image type
pathology_df = df[df["type"] == "pathology"]
tissue_df = df[df["type"] == "tissue"]
```
## Data Source
This dataset is derived from the **Human Protein Atlas** (https://www.proteinatlas.org/), a comprehensive resource for protein expression in human tissues and cancers.
## License
Please refer to the Human Protein Atlas data usage terms at https://www.proteinatlas.org/about/licence for licensing information.
## 📧 Contact
For questions or suggestions, please contact: [jjnirschl@wisc.edu](mailto:jjnirschl@wisc.edu) or [zhi.huang@pennmedicine.upenn.edu](mailto:zhi.huang@pennmedicine.upenn.edu)
提供机构:
nirschl-lab



