five

nirschl-lab/hpa10m

收藏
Hugging Face2026-02-12 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nirschl-lab/hpa10m
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 configs: - config_name: default data_files: - split: train path: "hpa10m_train/*.tar" - split: validation path: "hpa10m_validation/*.tar" dataset_info: features: - name: __key__ dtype: string - name: __url__ dtype: string - name: jpg dtype: image - name: json struct: - name: comments sequence: string - name: custom_metadata struct: - name: area_fraction dtype: float64 - name: area_px dtype: int64 - name: bboxes sequence: sequence: int64 - name: caption_1 dtype: string - name: caption_2 dtype: string - name: cell_type dtype: string - name: ensembl_id dtype: string - name: file_size_kb dtype: float64 - name: gene dtype: string - name: generic_caption dtype: string - name: image_md5 dtype: string - name: patient_age dtype: float64 - name: patient_id dtype: int64 - name: patient_sex dtype: string - name: rle_mask dtype: string - name: snomed_code dtype: string - name: snomed_text dtype: string - name: staining_intensity dtype: string - name: staining_location dtype: string - name: staining_quantity dtype: string - name: tissue dtype: string - name: uberon_id dtype: string - name: uniprot_id dtype: string - name: url dtype: string - name: instances sequence: struct: - name: attributes sequence: string - name: classId dtype: int64 - name: className dtype: string - name: createdAt dtype: string - name: creationType dtype: string - name: error dtype: string - name: exclude sequence: string - name: groupId dtype: int64 - name: id dtype: string - name: locked dtype: bool - name: pointLabels struct: - name: _placeholder dtype: string - name: points sequence: int64 - name: probability dtype: int64 - name: type dtype: string - name: updatedAt dtype: string - name: metadata struct: - name: annotatorEmail dtype: string - name: format dtype: string - name: height dtype: int64 - name: isPredicted dtype: bool - name: name dtype: string - name: pinned dtype: bool - name: projectId dtype: string - name: qaEmail dtype: string - name: status dtype: string - name: width dtype: int64 - name: tags sequence: string --- # HPA10M Dataset A large-scale immunohistochemistry (IHC) image dataset derived from the Human Protein Atlas (HPA, https://www.proteinatlas.org/), containing approximately **10.5 million** pathology and tissue images with detailed annotations. ## Dataset Overview | Statistic | Value | |-----------|-------| | **Total Images** | 10,495,672 | | **Training Set** | 10,493,672 images (10,497 tar files) | | **Validation Set** | 2,000 images (1 tar file) | | **Image Types** | Pathology (7,970,595) / Tissue (2,525,077) | | **Format** | JPEG images + JSON metadata | ## Directory Structure ``` hpa10m/ ├── README.md # This file ├── example_images/ # Sample images for preview ├── hpa10m_train/ # Training data (WebDataset tar files) │ ├── hpa10m_train_0000.tar # Training shards (10,497 files) │ ├── hpa10m_train_0001.tar │ ├── ... ├── hpa10m_validation/ # Validation data │ └── hpa10m_validation.tar # All validation samples (2,000 images) └── hpa10m_tar_summary/ # Metadata index files └── all.feather # Complete index of all images ``` ## Data Format ### Tar Archives (WebDataset Format) Each tar file contains paired `.jpg` and `.json` files organized by: - **Image category**: `pathology/` or `tissue/` - **Gene prefix**: Two-letter gene name prefix (e.g., `AB/`, `CD/`) ### JSON Metadata Structure Each image has a corresponding JSON file with rich annotations: ```json { "metadata": { "height": 3000, "width": 3000, "name": "image_filename.jpg", "format": ".jpg" }, "custom_metadata": { "gene": "TEKT3", "ensembl_id": "ENSG00000125409", "uniprot_id": "Q9BXF9", "tissue": "skin cancer", "cell_type": "Tumor cells", "patient_id": 3354, "patient_age": 92, "patient_sex": "male", "snomed_code": "M-80703;T-01000", "snomed_text": "Squamous cell carcinoma, NOS;Skin", "staining_intensity": "negative", "staining_location": "none", "staining_quantity": "none", "generic_caption": "Immunohistochemical staining of human skin cancer...", "caption_1": "Detailed caption describing the image...", "caption_2": "Alternative caption...", "url": "http://images.proteinatlas.org/...", "bboxes": [[x, y, w, h], ...], "rle_mask": "encoded_segmentation_mask", "area_px": 3883806, "area_fraction": 0.431534 } } ``` ### Index Files (Feather Format) The `hpa10m_tar_summary/all.feather` file contains an index of all images with columns: | Column | Description | |--------|-------------| | `tar_filename` | Source tar archive name | | `split` | Dataset split (train/validation) | | `name` | Full path within tar archive | | `type` | Image type (pathology/tissue) | | `img_offset` | Byte offset of image in tar | | `img_size` | Image file size in bytes | | `json_offset` | Byte offset of JSON in tar | | `json_size` | JSON file size in bytes | ## Key Annotations ### Clinical Information - `gene`: Gene name (e.g., "TEKT3") - `ensembl_id`: Ensembl gene ID (e.g., "ENSG00000125409") - `uniprot_id`: UniProt protein ID (e.g., "Q9BXF9") - `tissue`: Tissue or cancer type (e.g., "skin cancer") - `uberon_id`: UBERON ontology ID - `cell_type`: Cell type (e.g., "Tumor cells") - `patient_id`: Patient identifier - `patient_age`: Patient age - `patient_sex`: Patient sex ("male" / "female") - `snomed_code`: SNOMED-CT code (e.g., "M-80703;T-01000") - `snomed_text`: SNOMED-CT description (e.g., "Squamous cell carcinoma, NOS;Skin") ### Staining Characteristics - `staining_intensity`: "negative", "weak", "moderate", "strong" - `staining_location`: "nuclear", "cytoplasmic/membranous", "cytoplasmic/membranous,nuclear", "none" - `staining_quantity`: "none", "<25%", "25-75%", ">75%" ### Segmentation Data - `bboxes`: Bounding boxes in `[[x, y, width, height], ...]` format - `rle_mask`: Segmentation mask - `area_px`: Segmented area in pixels - `area_fraction`: Fraction of image covered by segmentation ### Natural Language Captions - `generic_caption`: Standardized description - `caption_1`: Detailed scientific description - `caption_2`: Alternative description ### Other Metadata - `url`: Original image URL from Human Protein Atlas - `image_md5`: MD5 hash of original image - `file_size_kb`: Image file size in KB ## Usage ### Loading Index with Pandas ```python import pandas as pd # Load complete index df = pd.read_feather("hpa10m_tar_summary/all.feather") # Filter by split train_df = df[df["split"] == "train"] val_df = df[df["split"] == "validation"] # Filter by image type pathology_df = df[df["type"] == "pathology"] tissue_df = df[df["type"] == "tissue"] ``` ## Data Source This dataset is derived from the **Human Protein Atlas** (https://www.proteinatlas.org/), a comprehensive resource for protein expression in human tissues and cancers. ## License Please refer to the Human Protein Atlas data usage terms at https://www.proteinatlas.org/about/licence for licensing information. ## 📧 Contact For questions or suggestions, please contact: [jjnirschl@wisc.edu](mailto:jjnirschl@wisc.edu) or [zhi.huang@pennmedicine.upenn.edu](mailto:zhi.huang@pennmedicine.upenn.edu)
提供机构:
nirschl-lab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作