A Streamlined Workflow to Facilitate Genome-Scale AI. In Data Science & Engineering Master of Advanced Study (DSE MAS) Capstone Projects
收藏DataCite Commons2026-04-17 更新2026-05-06 收录
下载链接:
https://library.ucsd.edu/dc/object/bb4755006h
下载链接
链接失效反馈官方服务:
资源简介:
Since the first draft of the human genome was published in the early 2000s, the drastic reduction
in the costs of sequencing technologies and speed by which such technology is able to sequence
genomes has defied Moore’s Law and allowed scientists to collect a wealth of data regarding the
code of life. When extending genomic sequencing to large populations of humans or other animals,
the genomic data of each individual, also known as the genotype, can be paired with observable
or measurable traits of the individual, also known as the phenotype, and allow for relationships
between genotypes and phenotypes to be analyzed. This type of work has powered numerous
Genome-Wide Association Studies (GWAS) that have been able to associate disease with
underlying genetic causes that can root to the specific position on the genome. In this study we
have collected genotypes for a heterogenous population of about 13,000 rats using unphased
Whole-Genome Sequencing. Such data for each rat is represented as a constellation of point
mutations, also known as Single Nucleotide Polymorphisms (SNP), where each mutation is
characterized as a positional variation with respect to a rat reference genome, the rn6 reference
genome, following an assembly and alignment of nucleotide reads guided by the reference.
In conjunction with collecting rat genotypes by sequencing, we have also collected quantitative
behavioral phenotypes, such as locomotor activity in response to various external stimuli.
With this comprehensive collection of genotype data represented by millions of SNPs per rat
paired with behavioral phenotype data, this work seeks to assess the capabilities of Machine
Learning for predicting rat behavior on a whole-genome scale. However, considering the complexity
of genomic data, such as the sheer dimensionality of SNPs or dependent relationships that deduce
phenotypes, as well as the challenges due to a lack of phenotype data compared to the genotype
data, there are many different ways to go about a Machine Learning solution for predicting genotype to
phenotype. These methods all leverage different data processing and reduction techniques, different
models, and different tuned hypermeters. To handle the complexity in devising Machine Learning
models that predict phenotypes from millions of SNPs, we propose framework that leverages powerful
ML workflows, namely hydra, optuna, and MLFlow combined with a highly scalable genomics toolkit,
namely sgkit, to devise a streamlined pipeline meant to facilitate genome-scale AI.
提供机构:
UC San Diego Library Digital Collections
创建时间:
2023-08-11



