A Streamlined Workflow to Facilitate Genome-Scale AI. In Data Science & Engineering Master of Advanced Study (DSE MAS) Capstone Projects

Name: A Streamlined Workflow to Facilitate Genome-Scale AI. In Data Science & Engineering Master of Advanced Study (DSE MAS) Capstone Projects
Creator: UC San Diego Library Digital Collections
Published: 2026-04-17 00:54:11
License: 暂无描述

DataCite Commons2026-04-17 更新2026-05-06 收录

下载链接：

https://library.ucsd.edu/dc/object/bb4755006h

下载链接

链接失效反馈

官方服务：

资源简介：

Since the first draft of the human genome was published in the early 2000s, the drastic reduction in the costs of sequencing technologies and speed by which such technology is able to sequence genomes has defied Moore’s Law and allowed scientists to collect a wealth of data regarding the code of life. When extending genomic sequencing to large populations of humans or other animals, the genomic data of each individual, also known as the genotype, can be paired with observable or measurable traits of the individual, also known as the phenotype, and allow for relationships between genotypes and phenotypes to be analyzed. This type of work has powered numerous Genome-Wide Association Studies (GWAS) that have been able to associate disease with underlying genetic causes that can root to the specific position on the genome. In this study we have collected genotypes for a heterogenous population of about 13,000 rats using unphased Whole-Genome Sequencing. Such data for each rat is represented as a constellation of point mutations, also known as Single Nucleotide Polymorphisms (SNP), where each mutation is characterized as a positional variation with respect to a rat reference genome, the rn6 reference genome, following an assembly and alignment of nucleotide reads guided by the reference. In conjunction with collecting rat genotypes by sequencing, we have also collected quantitative behavioral phenotypes, such as locomotor activity in response to various external stimuli. With this comprehensive collection of genotype data represented by millions of SNPs per rat paired with behavioral phenotype data, this work seeks to assess the capabilities of Machine Learning for predicting rat behavior on a whole-genome scale. However, considering the complexity of genomic data, such as the sheer dimensionality of SNPs or dependent relationships that deduce phenotypes, as well as the challenges due to a lack of phenotype data compared to the genotype data, there are many different ways to go about a Machine Learning solution for predicting genotype to phenotype. These methods all leverage different data processing and reduction techniques, different models, and different tuned hypermeters. To handle the complexity in devising Machine Learning models that predict phenotypes from millions of SNPs, we propose framework that leverages powerful ML workflows, namely hydra, optuna, and MLFlow combined with a highly scalable genomics toolkit, namely sgkit, to devise a streamlined pipeline meant to facilitate genome-scale AI.

提供机构：

UC San Diego Library Digital Collections

创建时间：

2023-08-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集

A Streamlined Workflow to Facilitate Genome-Scale AI. In Data Science &amp; Engineering Master of Advanced Study (DSE MAS) Capstone Projects

A Streamlined Workflow to Facilitate Genome-Scale AI. In Data Science & Engineering Master of Advanced Study (DSE MAS) Capstone Projects