Genomic signatures of Amplicon Sequence Variants (ASVs) for bioindicator discovery

Mehrdad Hajibabaei (IB)

Our work explores the various environmental and anthropogenic factors which influence the composition of benthic macro-invertebrate communities. In order to accomplish this we analyze datasets where each sample is described by potentially thousands of amplicon sequence variants (ASVs). From these ASVs we then attempt to identify any potential bio-indicators which are associated to conditions or clusters of interest. Developing efficient ways to accurately identify the taxonomy of ASVs is important since this understanding can provide a clearer picture of the factors at play within an environment. It may be possible to utilize genomic signatures in this task. Transforming DNA sequences into genomic signatures are particularly useful since their construction is computationally efficient and the properties of signature allow it to be used with a variety of machine learning algorithms. Genomic signatures have been used to study the properties of genomes and a considerable amount of work has demonstrated that they can be an effective means to construct trees that delineate groups of related organisms. However, before any comprehensive classification tools based on genomic signatures can be developed, the ability of these transformations delineate ASVs must be validated. This project would investigate the properties of various ASV sequence transformations (CGR, FCGR, RQA) and test how well these transformations can reconstruct the taxonomy of ASVs. Specifically, a student would:

  • Develop an understanding of relevant nucleic acid transformations and their applications
  • Construct a Python package which would apply a transformation to a sequence
  • Investigate the ability of transformations to cluster ASVs according to their known taxonomic rank
  • Investigate any potential outliers in the clustering results