transhumanist-already-exists/aida-asian-pbmc-cell-age-related-cell-sentence-balanced-120k

Name: transhumanist-already-exists/aida-asian-pbmc-cell-age-related-cell-sentence-balanced-120k
Creator: transhumanist-already-exists
Published: 2025-11-18 13:04:44
License: 暂无描述

Hugging Face2025-11-18 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/transhumanist-already-exists/aida-asian-pbmc-cell-age-related-cell-sentence-balanced-120k

下载链接

链接失效反馈

官方服务：

资源简介：

# Cell2Sentence for Longevity - Balanced Dataset (120k) This dataset is a carefully curated and balanced subset of single-cell transcriptomics data, designed for training foundation models to understand cellular aging and longevity patterns. ## Data Sources ### Primary Data Source - **CZI CellxGene Collection**: [Tabula Sapiens - A multiple-organ, single-cell transcriptomic atlas of humans](https://cellxgene.cziscience.com/collections/ced320a1-29f3-47c1-a735-513c7084d508) - Original collection provides comprehensive single-cell RNA-seq data across multiple human organs and donors ### Gene Annotations The dataset includes specialized gene sentence embeddings from two authoritative longevity databases: 1. **gene_sentence_opengenes**: Top age-related genes ranked by expression from [Open Genes Database](https://open-genes.com/) - Curated collection of genes associated with aging and longevity - Expression-based rankings for age-related biological processes 2. **gene_sentence_human_genage**: Human aging genes from [GenAge Database](https://genomics.senescence.info/genes/human.html) - The Human Ageing Genomic Resources (HAGR) collection - Genes potentially associated with human aging ## Dataset Statistics - **Total samples**: 119,792 - **Total donors**: 625 - **Train samples**: 95,846 (500 donors, 80.0%) - **Test samples**: 23,946 (125 donors, 20.0%) - **Unique cell types**: 32 - **Samples per donor**: 192 (median), 191.7 (mean) ### Train/Test Split - **Train**: 80% of donors (500 donors) - **Test**: 20% of donors (125 donors) - **No donor overlap**: Complete separation between train and test sets ## Data Preparation Pipeline ### 1. Diversity-Preserving Donor Split We created a stratified 80/20 train/test split at the donor level to ensure proportional representation of demographic and clinical characteristics: **Stratification factors**: - Age (binned into groups) - Disease status - Sex - Self-reported ethnicity - Smoking status This approach ensures that the test set is representative of the overall donor population distribution. **Stratification results**: - **Total unique strata**: 63 - Strata with only 1 donor: 11 - Strata with 2-5 donors: 8 - Strata with 6+ donors: 44 ### 2. Balanced Sampling Strategy To create a balanced dataset with equal donor representation: **Target**: 120,000 samples total (192 samples per donor × 625 donors) **Sampling algorithm**: 1. Each donor contributes exactly 192 samples (or all available if fewer) 2. Samples are distributed proportionally across cell types for each donor 3. Two-pass redistribution algorithm: - **First pass**: Allocate target samples to each cell type (192 ÷ number of cell types) - **Second pass**: If any cell type has insufficient samples, redistribute the shortfall to other cell types from the same donor **Results**: - 623 out of 625 donors achieve exactly 192 samples - 2 donors have fewer samples due to limited data availability in source: - JP_RIK_H007: 71 samples (only 71 cells available in original data) - SG_HEL_H001: 159 samples (only 159 cells available in original data) ### 3. Technical Implementation **Parallel processing**: - All operations use 64-worker parallel processing for efficiency - HuggingFace `datasets` library with `num_proc=64` for fast data loading - Efficient shuffling using HF datasets `.shuffle()` method **Reproducibility**: - Random seed: 42 for train set - Random seed: 1042 for test set - All sampling operations are deterministic ## Demographic Distribution Analysis The stratified split maintains excellent proportional representation across all demographic features: ### Age Distribution | Age Bin | Train % | Test % | Difference | |---------|---------|--------|------------| | 18-29 | 19.60% | 20.00% | +0.40% | | 30-39 | 29.00% | 30.40% | +1.40% | | 40-49 | 27.20% | 28.00% | +0.80% | | 50-59 | 14.60% | 12.80% | -1.80% | | 60-69 | 8.60% | 8.00% | -0.60% | | 70+ | 1.00% | 0.80% | -0.20% | **Maximum difference:** ±1.80% ### Sex Distribution | Sex | Train % | Test % | Difference | |--------|---------|--------|------------| | Female | 55.40% | 58.40% | +3.00% | | Male | 44.60% | 41.60% | -3.00% | **Maximum difference:** ±3.00% ### Ethnicity Distribution | Ethnicity | Train % | Test % | Difference | |---------------------|---------|--------|------------| | Indian | 4.80% | 4.80% | +0.00% | | Japanese | 24.00% | 23.20% | -0.80% | | Korean | 26.20% | 27.20% | +1.00% | | Singaporean Chinese | 13.60% | 13.60% | +0.00% | | Singaporean Indian | 11.20% | 11.20% | +0.00% | | Singaporean Malay | 9.60% | 10.40% | +0.80% | | Thai | 9.80% | 8.00% | -1.80% | | Unknown | 0.80% | 1.60% | +0.80% | **Maximum difference:** ±1.80% ### Disease Distribution | Disease | Train % | Test % | Difference | |---------|---------|--------|------------| | Normal | 100.00% | 100.00%| +0.00% | All donors in the dataset have normal disease status. ### Smoking Status Distribution - **Train**: 71.6% non-smokers, 23.0% smokers, 5.4% unknown - **Test**: 67.2% non-smokers, 29.6% smokers, 3.2% unknown - **Maximum difference**: ±6.6% (acceptable variation for stratified sampling) ### Distribution Quality Assessment The stratified split maintains excellent proportional representation: - **Overall maximum difference:** ±3.0% (sex distribution) - **Most features:** Within ±2% difference - **Perfect matches (0% difference):** - Indian ethnicity - Singaporean Chinese ethnicity - Singaporean Indian ethnicity - Disease status The small differences observed are within acceptable ranges for stratified sampling and are largely due to: 1. Strata with very few donors (1-5 donors) 2. Rounding effects from the 80/20 split 3. Random assignment of single-donor strata The maximum deviation of 3% in sex distribution is well within acceptable limits for machine learning applications, ensuring that model evaluation on the test set will be representative of the overall population. ## Cell Type Distribution The dataset includes 32 unique cell types spanning multiple organ systems: - Immune cells (T cells, B cells, NK cells, monocytes, dendritic cells) - Blood cells (erythrocytes, platelets) - Tissue-specific cells (hepatocytes, epithelial cells, endothelial cells) - And more... ## Donor Diversity The dataset represents diverse demographic and clinical characteristics: - **Age range**: Multiple age groups from young adults to elderly (18-70+) - **Geographic diversity**: Multiple countries and ethnicities (Indian, Japanese, Korean, Singaporean Chinese/Indian/Malay, Thai) - **Health status**: All healthy donors (normal disease status) - **Smoking status**: Balanced representation of smokers and non-smokers ## Data Format Each sample contains: - **Gene expression data**: Transcriptomic profiles - **Cell metadata**: Cell type, donor ID, tissue information - **Donor metadata**: Age, sex, disease, ethnicity, smoking status - **Gene annotations**: Longevity-related gene sentence embeddings ## Use Cases This dataset is designed for: - Training foundation models for cellular aging - Understanding cell-type-specific aging patterns - Predicting longevity biomarkers - Cross-donor generalization studies - Age-related disease research ## Citation If you use this dataset, please cite: 1. **Original data source**: - The Tabula Sapiens Consortium. (2022). The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science. - CellxGene Collection: https://cellxgene.cziscience.com/collections/ced320a1-29f3-47c1-a735-513c7084d508 2. **Gene annotations**: - Open Genes Database: https://open-genes.com/ - GenAge Database: Tacutu, R., et al. (2018). Human Ageing Genomic Resources: new and updated databases. Nucleic Acids Research. ## Files - `train.parquet` (16GB): Training dataset with 95,846 samples - `test.parquet` (3.9GB): Test dataset with 23,946 samples - `stats.json`: Detailed dataset statistics ## Preprocessing Scripts The complete data preparation pipeline is available in our repository, including: - `create_diversity_split.py`: Stratified donor split generation - `create_balanced_dataset_hf.py`: Balanced sampling with redistribution - `check_smoking_distribution.py`: Demographic validation - `check_donor_overlap.py`: Train/test separation verification ## License Please refer to the original CZI CellxGene data license and terms of use for the Tabula Sapiens collection. ## Acknowledgments - Chan Zuckerberg Initiative for the CellxGene Data Portal - The Tabula Sapiens Consortium for the original dataset - Open Genes and GenAge teams for longevity gene annotations --- *Dataset generated: 2025-11-14*

提供机构：

transhumanist-already-exists