(External) Sharing, searching, and analysis of genomics datasets in the cloud

Advisor: Heather Ward, DNAstack

No specific proposed co-advisors. See Bioinformatics Faculty page for faculty listings.

About Us

DNAstack is a team of software engineers, researchers, and geneticists building a cloud software solution that will accelerate discoveries in genomics and benefit human health. We believe that our platform will power the next generation of genomics research. DNAstack is situated in Toronto, Canada and is built on top of some of the leading cloud solutions. We work closely with researchers, clinicians and other users to help them realise their bioinformatics needs at scale.

DNAstack is committed to the development of open science and is part of global initiatives for the standardisation of bioinformatics tools and standards. We strive to incorporate the latest open APIs to increase data accessibility for all users and to commoditize the access to scalable bioinformatics.

More information on DNAstack can be found at https://dnastack.com

 

Project Description

This project offers students the opportunity to apply bioinformatics practices in an industry environment. Students will write a bioinformatics workflow and associated documentation that will act as an introduction and walkthrough to DNAstack’s product suite. The aim of this workflow will be to run a simple analysis on data that has been shared using DNAstack’s Publisher, queried using Explorer, and run via Workbench. This workflow and its associated documentation will be made publicly available and will act as a tutorial and introduction to the end to end use of DNAstack’s product suite.

The workflow will be written in the Workflow Description Language (WDL) and its tools will be containerised using Docker. All code will be version controlled using git. The student will set up their bioinformatics pipeline to run reproducibly in all three major cloud environments (Amazon Web Services, Microsoft Azure, and Google Cloud Platform) as well as in an HPC environment, and will optimise their pipeline for speed, accuracy, and compute cost. The student will validate the accuracy of their pipeline and write thorough documentation detailing the analysis steps, pipeline inputs and outputs, and the value added by their analysis. Finally, the student will publish their workflow and documentation on Dockstore, where it can be run by other researchers.

Students will be responsible for developing, testing and ensuring the quality of the workflow they write. Examples of some potential workflows a student could work on include:

  • RNA-Seq analysis
  • Lineage assignment and construction of phylogenetic trees
  • Structural and copy number variant analysis

These pipelines should be based on accepted, published methods but will offer students the chance to use the latest containerization, bioinformatics, and cloud technologies. Additionally, students will be exposed to a high throughput cloud environment, capable of analysing terabytes of data concurrently. Students will gain invaluable hands-on cloud experience.

This is a one-semester, remote position.

 

Project Responsibilities

  • The student will familiarise themselves with the existing public datasets available on DNAstack’s Explorer networks (e.g. viral.ai, neuroscience.ai) and choose at least two to be combined and co-analysed; alternatively, the student may choose an existing data source and identify a second public dataset that is not currently available in any of DNAstack’s networks, and add this dataset to DNAstack’s network.
  • The student will determine a simple analysis that can be run on the given datasets. This analysis should run quickly (since it is meant to act as a tutorial workflow for users to get up and running with DNAstack’s tooling), and produce at minimum visually interesting graphs or charts as output.
  • The student will package the steps and tools necessary to analyse their chosen dataset into Docker containers and write their analysis into a workflow written in Workflow Description Language (WDL).
  • The student will thoroughly document the steps involved in a) connecting a new dataset and sharing data using Publisher; b) querying datasets in DNastack’s networks using Explorer; and c) running their analysis pipeline using Workbench. These steps should be documented both when run via a web browser and the corresponding commands when run via DNAstack’s command-line interface (CLI).
  • The student will test all aspects of created workflows and will assist in the quality assurance of the outputs.

 

Knowledge and Skills

  • Experience with R, Python, Bash, Java, or related data analysis languages (required).
  • Experience with Unix and common bioinformatics tools for genomic analysis (required).
  • A strong background in the field of molecular genetics, particularly in the analysis of DNA and RNA NGS data (required).
  • Experience working in a cloud environment would be considered an asset.
  • Experience with Docker, Workflow Description Language, and version control would be considered an asset.
  • Experience with databases/SQL would be considered an asset.