(External) Automated Construction of Viral Sequence Signature Features for Classification

Advisor: Dr. Oliver Lung, Canadian Food Inspection Agency

Suggested Co-Advisors: Zeny Feng, Stefan Kremer, Baozhong Meng, Dan Tulpan

 

The sensitivity and specificity of current bioinformatics pipelines (such as those using Kraken2 and Centrifuge) used to assign taxonomic names and ranks to viral read is under 95% and 90%, respectively. Furthermore, limited benchmarking of classification pipelines for viral metagenomics data has been completed. The goal of this project is to create a quality controlled database of key mammalian viral pathogens. Students will perform a literature review to understand what useful features for clustering and/or classification can be extracted from viral genomes. Students will test this hypothesis by extracting these features from each of the samples in the database to train simple predictive models so that a better understanding of which features can form useful components of a viral genome can be developed. If the project is extended, students will design a deep learning model using these features to predict viral taxonomic information.

This is a one-semester project with the potential to extend to two semesters.

Knowledge/Skills

Python scripting, familiarity with different types of machine learning models, feature selection/extraction

If students have any additional skills, we encourage them to point it out as it would be an asset.