five

A Streamlined Workflow to Facilitate Genome-Scale AI. In Data Science & Engineering Master of Advanced Study (DSE MAS) Capstone Projects

收藏
DataCite Commons2026-04-17 更新2026-05-06 收录
下载链接:
https://library.ucsd.edu/dc/object/bb4755006h
下载链接
链接失效反馈
官方服务:
资源简介:
Since the first draft of the human genome was published in the early 2000s, the drastic reduction in the costs of sequencing technologies and speed by which such technology is able to sequence genomes has defied Moore’s Law and allowed scientists to collect a wealth of data regarding the code of life. When extending genomic sequencing to large populations of humans or other animals, the genomic data of each individual, also known as the genotype, can be paired with observable or measurable traits of the individual, also known as the phenotype, and allow for relationships between genotypes and phenotypes to be analyzed. This type of work has powered numerous Genome-Wide Association Studies (GWAS) that have been able to associate disease with underlying genetic causes that can root to the specific position on the genome. In this study we have collected genotypes for a heterogenous population of about 13,000 rats using unphased Whole-Genome Sequencing. Such data for each rat is represented as a constellation of point mutations, also known as Single Nucleotide Polymorphisms (SNP), where each mutation is characterized as a positional variation with respect to a rat reference genome, the rn6 reference genome, following an assembly and alignment of nucleotide reads guided by the reference. In conjunction with collecting rat genotypes by sequencing, we have also collected quantitative behavioral phenotypes, such as locomotor activity in response to various external stimuli. With this comprehensive collection of genotype data represented by millions of SNPs per rat paired with behavioral phenotype data, this work seeks to assess the capabilities of Machine Learning for predicting rat behavior on a whole-genome scale. However, considering the complexity of genomic data, such as the sheer dimensionality of SNPs or dependent relationships that deduce phenotypes, as well as the challenges due to a lack of phenotype data compared to the genotype data, there are many different ways to go about a Machine Learning solution for predicting genotype to phenotype. These methods all leverage different data processing and reduction techniques, different models, and different tuned hypermeters. To handle the complexity in devising Machine Learning models that predict phenotypes from millions of SNPs, we propose framework that leverages powerful ML workflows, namely hydra, optuna, and MLFlow combined with a highly scalable genomics toolkit, namely sgkit, to devise a streamlined pipeline meant to facilitate genome-scale AI.
提供机构:
UC San Diego Library Digital Collections
创建时间:
2023-08-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作