transhumanist-already-exists/aida-asian-pbmc-cell-age-related-cell-sentence-balanced-120k
收藏Hugging Face2025-11-18 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/transhumanist-already-exists/aida-asian-pbmc-cell-age-related-cell-sentence-balanced-120k
下载链接
链接失效反馈官方服务:
资源简介:
# Cell2Sentence for Longevity - Balanced Dataset (120k)
This dataset is a carefully curated and balanced subset of single-cell transcriptomics data, designed for training foundation models to understand cellular aging and longevity patterns.
## Data Sources
### Primary Data Source
- **CZI CellxGene Collection**: [Tabula Sapiens - A multiple-organ, single-cell transcriptomic atlas of humans](https://cellxgene.cziscience.com/collections/ced320a1-29f3-47c1-a735-513c7084d508)
- Original collection provides comprehensive single-cell RNA-seq data across multiple human organs and donors
### Gene Annotations
The dataset includes specialized gene sentence embeddings from two authoritative longevity databases:
1. **gene_sentence_opengenes**: Top age-related genes ranked by expression from [Open Genes Database](https://open-genes.com/)
- Curated collection of genes associated with aging and longevity
- Expression-based rankings for age-related biological processes
2. **gene_sentence_human_genage**: Human aging genes from [GenAge Database](https://genomics.senescence.info/genes/human.html)
- The Human Ageing Genomic Resources (HAGR) collection
- Genes potentially associated with human aging
## Dataset Statistics
- **Total samples**: 119,792
- **Total donors**: 625
- **Train samples**: 95,846 (500 donors, 80.0%)
- **Test samples**: 23,946 (125 donors, 20.0%)
- **Unique cell types**: 32
- **Samples per donor**: 192 (median), 191.7 (mean)
### Train/Test Split
- **Train**: 80% of donors (500 donors)
- **Test**: 20% of donors (125 donors)
- **No donor overlap**: Complete separation between train and test sets
## Data Preparation Pipeline
### 1. Diversity-Preserving Donor Split
We created a stratified 80/20 train/test split at the donor level to ensure proportional representation of demographic and clinical characteristics:
**Stratification factors**:
- Age (binned into groups)
- Disease status
- Sex
- Self-reported ethnicity
- Smoking status
This approach ensures that the test set is representative of the overall donor population distribution.
**Stratification results**:
- **Total unique strata**: 63
- Strata with only 1 donor: 11
- Strata with 2-5 donors: 8
- Strata with 6+ donors: 44
### 2. Balanced Sampling Strategy
To create a balanced dataset with equal donor representation:
**Target**: 120,000 samples total (192 samples per donor × 625 donors)
**Sampling algorithm**:
1. Each donor contributes exactly 192 samples (or all available if fewer)
2. Samples are distributed proportionally across cell types for each donor
3. Two-pass redistribution algorithm:
- **First pass**: Allocate target samples to each cell type (192 ÷ number of cell types)
- **Second pass**: If any cell type has insufficient samples, redistribute the shortfall to other cell types from the same donor
**Results**:
- 623 out of 625 donors achieve exactly 192 samples
- 2 donors have fewer samples due to limited data availability in source:
- JP_RIK_H007: 71 samples (only 71 cells available in original data)
- SG_HEL_H001: 159 samples (only 159 cells available in original data)
### 3. Technical Implementation
**Parallel processing**:
- All operations use 64-worker parallel processing for efficiency
- HuggingFace `datasets` library with `num_proc=64` for fast data loading
- Efficient shuffling using HF datasets `.shuffle()` method
**Reproducibility**:
- Random seed: 42 for train set
- Random seed: 1042 for test set
- All sampling operations are deterministic
## Demographic Distribution Analysis
The stratified split maintains excellent proportional representation across all demographic features:
### Age Distribution
| Age Bin | Train % | Test % | Difference |
|---------|---------|--------|------------|
| 18-29 | 19.60% | 20.00% | +0.40% |
| 30-39 | 29.00% | 30.40% | +1.40% |
| 40-49 | 27.20% | 28.00% | +0.80% |
| 50-59 | 14.60% | 12.80% | -1.80% |
| 60-69 | 8.60% | 8.00% | -0.60% |
| 70+ | 1.00% | 0.80% | -0.20% |
**Maximum difference:** ±1.80%
### Sex Distribution
| Sex | Train % | Test % | Difference |
|--------|---------|--------|------------|
| Female | 55.40% | 58.40% | +3.00% |
| Male | 44.60% | 41.60% | -3.00% |
**Maximum difference:** ±3.00%
### Ethnicity Distribution
| Ethnicity | Train % | Test % | Difference |
|---------------------|---------|--------|------------|
| Indian | 4.80% | 4.80% | +0.00% |
| Japanese | 24.00% | 23.20% | -0.80% |
| Korean | 26.20% | 27.20% | +1.00% |
| Singaporean Chinese | 13.60% | 13.60% | +0.00% |
| Singaporean Indian | 11.20% | 11.20% | +0.00% |
| Singaporean Malay | 9.60% | 10.40% | +0.80% |
| Thai | 9.80% | 8.00% | -1.80% |
| Unknown | 0.80% | 1.60% | +0.80% |
**Maximum difference:** ±1.80%
### Disease Distribution
| Disease | Train % | Test % | Difference |
|---------|---------|--------|------------|
| Normal | 100.00% | 100.00%| +0.00% |
All donors in the dataset have normal disease status.
### Smoking Status Distribution
- **Train**: 71.6% non-smokers, 23.0% smokers, 5.4% unknown
- **Test**: 67.2% non-smokers, 29.6% smokers, 3.2% unknown
- **Maximum difference**: ±6.6% (acceptable variation for stratified sampling)
### Distribution Quality Assessment
The stratified split maintains excellent proportional representation:
- **Overall maximum difference:** ±3.0% (sex distribution)
- **Most features:** Within ±2% difference
- **Perfect matches (0% difference):**
- Indian ethnicity
- Singaporean Chinese ethnicity
- Singaporean Indian ethnicity
- Disease status
The small differences observed are within acceptable ranges for stratified sampling and are largely due to:
1. Strata with very few donors (1-5 donors)
2. Rounding effects from the 80/20 split
3. Random assignment of single-donor strata
The maximum deviation of 3% in sex distribution is well within acceptable limits for machine learning applications, ensuring that model evaluation on the test set will be representative of the overall population.
## Cell Type Distribution
The dataset includes 32 unique cell types spanning multiple organ systems:
- Immune cells (T cells, B cells, NK cells, monocytes, dendritic cells)
- Blood cells (erythrocytes, platelets)
- Tissue-specific cells (hepatocytes, epithelial cells, endothelial cells)
- And more...
## Donor Diversity
The dataset represents diverse demographic and clinical characteristics:
- **Age range**: Multiple age groups from young adults to elderly (18-70+)
- **Geographic diversity**: Multiple countries and ethnicities (Indian, Japanese, Korean, Singaporean Chinese/Indian/Malay, Thai)
- **Health status**: All healthy donors (normal disease status)
- **Smoking status**: Balanced representation of smokers and non-smokers
## Data Format
Each sample contains:
- **Gene expression data**: Transcriptomic profiles
- **Cell metadata**: Cell type, donor ID, tissue information
- **Donor metadata**: Age, sex, disease, ethnicity, smoking status
- **Gene annotations**: Longevity-related gene sentence embeddings
## Use Cases
This dataset is designed for:
- Training foundation models for cellular aging
- Understanding cell-type-specific aging patterns
- Predicting longevity biomarkers
- Cross-donor generalization studies
- Age-related disease research
## Citation
If you use this dataset, please cite:
1. **Original data source**:
- The Tabula Sapiens Consortium. (2022). The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science.
- CellxGene Collection: https://cellxgene.cziscience.com/collections/ced320a1-29f3-47c1-a735-513c7084d508
2. **Gene annotations**:
- Open Genes Database: https://open-genes.com/
- GenAge Database: Tacutu, R., et al. (2018). Human Ageing Genomic Resources: new and updated databases. Nucleic Acids Research.
## Files
- `train.parquet` (16GB): Training dataset with 95,846 samples
- `test.parquet` (3.9GB): Test dataset with 23,946 samples
- `stats.json`: Detailed dataset statistics
## Preprocessing Scripts
The complete data preparation pipeline is available in our repository, including:
- `create_diversity_split.py`: Stratified donor split generation
- `create_balanced_dataset_hf.py`: Balanced sampling with redistribution
- `check_smoking_distribution.py`: Demographic validation
- `check_donor_overlap.py`: Train/test separation verification
## License
Please refer to the original CZI CellxGene data license and terms of use for the Tabula Sapiens collection.
## Acknowledgments
- Chan Zuckerberg Initiative for the CellxGene Data Portal
- The Tabula Sapiens Consortium for the original dataset
- Open Genes and GenAge teams for longevity gene annotations
---
*Dataset generated: 2025-11-14*
提供机构:
transhumanist-already-exists



