(Internal) Feature Selection on Whole-genome SNP Data

Advisor: Yan Yan, Computer Science

Proposed biological co-advisor: Lewis Lukens, Plant Agriculture

With the development of next generation sequencing, vast amount of whole-genome data becomes available nowadays. Genome-Wide Association Studies (GWAS) have served as primary methods for the past decade for identifying associations between genetic variants and traits or diseases (also known as phenotype). The most often used genetic variants are Single Nucleotide Polymorphisms (SNPs), which are changes of single DNA base-pairs. Out of all SNPs, the ones correlated with phenotypes are just a small subset, which means that the genotype data is sparse. In addition, there are much more SNPs than the number of samples. Such data brings a complex challenge to current GWAS methods as many of them fail to find the correlated SNPs on large complex data.

Features selection on the whole-genome SNP data can potentially solve this issue. By selecting a
reduced number of SNPs (features) with significantly larger effects compared to other SNPs, researchers can apply existing methods on the most promising SNPs. Penalized regression models including Least Absolute Shrinkage and Selection Operator (LASSO) and its variations are proven to be very well-suited for sparse problems often serving as a good way to select significant features.

Currently, we have studied several methods [1,2] on a model plant, Arabidopsis Thaliana, and we plan to expand the study to further validate the findings and improve the methods. Tasks for the student working on the project may include:
• Collect additional SNP datasets from online database and test the existing developed methods to validate the findings.
• Develop a tool/pipeline to automatedly run multiple methods on the same dataset and compare the performance.
• Explore the literature to find other methods that could be used for feature selection, e.g. deep learning methods, and apply them on the SNP datasets to examine the performance.

The project is intended to be completed by one semester but could also be extended to two semesters.

References
[1]Nikita Kohli, Jabed Tomal, Wenjun Lin, and Yan Yan (2023). PentaPen: Combining Penalized Models for Identification of Important SNPs on Whole-genome Arabidopsis Thaliana Data. MDPI Genes, minor revision.
[2]Nisha Puthiyedth, Nuoyi Zhang, Ziqing Wang, Yan Yan (2021): Performance Comparison of LASSO Variants with Genome-Wide Association Studies (GWAS). IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2021, 1682-1684.