strand-ai/variantformer-1000g
收藏Hugging Face2026-01-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/strand-ai/variantformer-1000g
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- tabular-classification
tags:
- genomics
- variants
- 1000-genomes
- gene-expression
- variantformer
size_categories:
- 100G<n<1T
---
# VariantFormer 1000 Genomes Dataset
Gene expression predictions from VariantFormer for 538 samples from the 1000 Genomes Project.
## Dataset Structure
```
├── manifest.csv # Sample metadata (population, sex)
├── predictions/ # VariantFormer predictions
│ └── {sample_id}.parquet
└── vcf/ # Per-sample VCF files
├── {sample_id}.vcf.gz
└── {sample_id}.vcf.gz.tbi
```
## Files
- **manifest.csv**: Sample metadata with columns: `sample_id`, `population`, `superpopulation`, `sex`
- **Parquet files**: VariantFormer gene expression predictions (~446 MB per sample, ~240 GB total)
- **VCF files**: Variant calls per sample with tabix indexes (~380 GB total)
## Usage
```python
import pandas as pd
from huggingface_hub import hf_hub_download, snapshot_download
# Download and load sample manifest
manifest_path = hf_hub_download(
repo_id="strand-ai/variantformer-1000g",
filename="manifest.csv",
repo_type="dataset"
)
manifest = pd.read_csv(manifest_path)
# Download predictions for a single sample
pred_path = hf_hub_download(
repo_id="strand-ai/variantformer-1000g",
filename="predictions/HG00418.parquet",
repo_type="dataset"
)
df = pd.read_parquet(pred_path)
# Download ALL data locally (~620 GB)
snapshot_download(
repo_id="strand-ai/variantformer-1000g",
repo_type="dataset",
local_dir="./variantformer-1000g"
)
```
## Interactive Explorer
Explore the data interactively at [strandai.bio/1000g-variantformer](https://strandai.bio/1000g-variantformer)
## Citation
If you use this dataset, please cite:
```
@dataset{strand_variantformer_1000g,
title={VariantFormer 1000 Genomes Predictions},
author={Strand AI},
year={2026},
url={https://huggingface.co/datasets/strand-ai/variantformer-1000g}
}
```
## License
This dataset is released under CC-BY-4.0 for research use.
## Contact
Questions? Email us at [founders@strandai.bio](mailto:founders@strandai.bio)
提供机构:
strand-ai



