Barcode Cracking

Posted on Tuesday, February 25th, 2020

Photograph of a woman's eye with an overlay image of a supermarket barcode.
U of G’s Barcode of Life Data System houses over four million DNA barcodes—fragments of DNA used to identify the Earth’s diverse species.

U of G computer scientists and biologists create software to help catalogue our world’s species.

It’s staggering. More than 800 plant and animal species having gone extinct in the past 500 years, and over 30,000 species are threatened with extinction. Perhaps more troubling: we cannot comprehend the scale of the conservation problem because most endangered species have yet to be identified. In fact, we have no idea how many species exist on Earth. Estimates range from less than two million up to one trillion, and scientists are discovering new species every day.

One method of identifying and cataloging species is DNA “barcoding,” where biologists analyze DNA samples taken from living things to categorize them into species groups. However, when only a small number of samples is available, specimens may be misidentified. Furthermore, it is challenging to know how many samples are needed to accurately assign specimens to species groups.

University of Guelph PhD Candidate, Jarrett Phillips, co-supervised by computer science professor Daniel Gillis and integrative biology professor Robert Hanner, have developed a free, publicly-available statistical software package that calculates how many samples are likely needed to accurately observe current levels of genetic variation existing within species. To assess within-species variation, scientists use “haplotype accumulation curves”—graphs that plot the number of unique DNA sequences (haplotypes) against the number of individuals sampled for a species of interest. When the line on the graph plateaus, most genetic variation for that species has likely been uncovered. With this information, biodiversity scientists can better ascertain whether they have enough DNA barcodes to confidently assign specimens to species. However, haplotype accumulation curves behave differently depending on the species in question, and most studies do not account for this variation. That’s why the research team developed a Haplotype Accumulation Curve Simulator, “HACSim,” which uses an algorithm to integrate species’ haplotype frequency information into its calculations. HACSim continuously improves on the sample size estimate through random sampling (meaning haplotypes are selected by chance until they have all been captured). The team tested their software on both hypothetical and real species datasets for fish and ticks.

“HACSim enabled us to efficiently calculate how many more samples we needed to improve the accuracy of these datasets. That kind of information will help researchers strategically allocate sampling efforts and build more comprehensive barcode libraries. All of this translates to a better understanding of the extraordinary diversity of life on Earth, which is crucial for saving endangered species,” says Dan Gillis.

This work was supported by the College of Engineering and Physical Sciences (CEPS) Graduate Excellence Entrance Scholarship to Jarrett D. Phillips.

Phillips, JD, French, SH, Hanner, RH, Gillis, DJ. HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves. PeerJ Computer Science 2020 Jan 6. doi: 10.7717/peerj-cs.243


News Archive