(External) Sharing, searching, and analysis of genomics datasets in the cloud

Advisor: Heather Ward, DNAstack

No specific proposed co-advisors. See Bioinformatics Faculty page for faculty listings.

About Us

DNAstack is a team of software engineers, researchers, and geneticists building a cloud software solution that will accelerate discoveries in genomics and beneﬁt human health. We believe that our platform will power the next generation of genomics research. DNAstack is situated in Toronto, Canada and is built on top of some of the leading cloud solutions. We work closely with researchers, clinicians and other users to help them realise their bioinformatics needs at scale.

DNAstack is committed to the development of open science and is part of global initiatives for the standardisation of bioinformatics tools and standards. We strive to incorporate the latest open APIs to increase data accessibility for all users and to commoditize the access to scalable bioinformatics.

More information on DNAstack can be found at https://dnastack.com

Project Description

This project offers students the opportunity to apply bioinformatics practices in an industry environment. Students will write a bioinformatics workﬂow and associated documentation that will act as an introduction and walkthrough to DNAstack’s product suite. The aim of this workﬂow will be to run a simple analysis on data that has been shared using DNAstack’s Publisher, queried using Explorer, and run via Workbench. This workﬂow and its associated documentation will be made publicly available and will act as a tutorial and introduction to the end to end use of DNAstack’s product suite.

The workﬂow will be written in the Workﬂow Description Language (WDL) and its tools will be containerised using Docker. All code will be version controlled using git. The student will set up their bioinformatics pipeline to run reproducibly in all three major cloud environments (Amazon Web Services, Microsoft Azure, and Google Cloud Platform) as well as in an HPC environment, and will optimise their pipeline for speed, accuracy, and compute cost. The student will validate the accuracy of their pipeline and write thorough documentation detailing the analysis steps, pipeline inputs and outputs, and the value added by their analysis. Finally, the student will publish their workﬂow and documentation on Dockstore, where it can be run by other researchers.

Students will be responsible for developing, testing and ensuring the quality of the workﬂow they write. Examples of some potential workﬂows a student could work on include:

RNA-Seq analysis
Lineage assignment and construction of phylogenetic trees
Structural and copy number variant analysis

These pipelines should be based on accepted, published methods but will offer students the chance to use the latest containerization, bioinformatics, and cloud technologies. Additionally, students will be exposed to a high throughput cloud environment, capable of analysing terabytes of data concurrently. Students will gain invaluable hands-on cloud experience.

This is a one-semester, remote position.

Project Responsibilities

The student will familiarise themselves with the existing public datasets available on DNAstack’s Explorer networks (e.g. viral.ai, neuroscience.ai) and choose at least two to be combined and co-analysed; alternatively, the student may choose an existing data source and identify a second public dataset that is not currently available in any of DNAstack’s networks, and add this dataset to DNAstack’s network.
The student will determine a simple analysis that can be run on the given datasets. This analysis should run quickly (since it is meant to act as a tutorial workﬂow for users to get up and running with DNAstack’s tooling), and produce at minimum visually interesting graphs or charts as output.
The student will package the steps and tools necessary to analyse their chosen dataset into Docker containers and write their analysis into a workﬂow written in Workﬂow Description Language (WDL).
The student will thoroughly document the steps involved in a) connecting a new dataset and sharing data using Publisher; b) querying datasets in DNastack’s networks using Explorer; and c) running their analysis pipeline using Workbench. These steps should be documented both when run via a web browser and the corresponding commands when run via DNAstack’s command-line interface (CLI).
The student will test all aspects of created workﬂows and will assist in the quality assurance of the outputs.

Knowledge and Skills

Experience with R, Python, Bash, Java, or related data analysis languages (required).
Experience with Unix and common bioinformatics tools for genomic analysis (required).
A strong background in the ﬁeld of molecular genetics, particularly in the analysis of DNA and RNA NGS data (required).
Experience working in a cloud environment would be considered an asset.
Experience with Docker, Workﬂow Description Language, and version control would be considered an asset.
Experience with databases/SQL would be considered an asset.

(External) Sharing, searching, and analysis of genomics datasets in the cloud

Share this page

Graduate Program in Bioinformatics

Slideshow Banners

(External) Sharing, searching, and analysis of genomics datasets in the cloud

Share this page