LiLabUNC/Variant-Foundation-Embeddings
收藏Hugging Face2025-12-12 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/LiLabUNC/Variant-Foundation-Embeddings
下载链接
链接失效反馈官方服务:
资源简介:
这里我们展示了基于Incorporating LLM Embeddings for Variation Across the Human Genome的大规模遗传分析变异级别嵌入,这些嵌入基于使用FAVOR、ClinVar和GWAS Catalog的高质量功能数据进行的注释。目前我们提供了使用OpenAI的text-embedding-3-large(3072维)或Qwen的Qwen3-Embedding-0.6B(1024维)模型生成的嵌入。遗传变异通过其染色体、位置(hg38构建)、基于UKB编码的参考等位基因、基于UKB编码的替代等位基因以及它们各自的LLM嵌入来标识。目前我们发布了以下规模的数据集:1. HapMap3 & MEGA(约150万个变异,使用OpenAI GPT-3.5);2. UKB Imputed(约9000万个变异,使用OpenAI GPT-3.5);3. All FAVOR Variants(约90亿个变异,使用Qwen3-0.6B)。更多数据集即将发布。
Here we present the variant level embeddings for large-scale genetic analyis as described in Incorporating LLM Embeddings for Variation Across the Human Genome, based on curated annotations using high quality functional data from FAVOR, ClinVar, and GWAS Catalog. We currently present embeddings using either OpenAIs text-embedding-3-large (3072-dimensional) or Qwens Qwen3-Embedding-0.6B (1024-dimensional) models. Genetic variants are identified with their chromosome, position (hg38 build), reference allele based on UKB coding, alternate allele based on UKB coding, and their respective LLM embeddings. Currently we release datasets at the following scales: 1. HapMap3 & MEGA (~1.5 million variants, OpenAI GPT-3.5); 2. UKB Imputed (~90 million variants, OpenAI GPT-3.5); 3. All FAVOR Variants (~9 billion variants, Qwen3-0.6B). With more to come shortly.
提供机构:
LiLabUNC



