Interdisciplinary research team enhances DNA barcoding statistical rigor in new research study.
DNA barcoding promised to identify the genetic “signatures” of global species in a time of extinction and ongoing environmental crisis. This system for species identification uses standardized genetic regions that act in a similar way to the barcodes used in supermarket scanners. More specifically, DNA barcoding involves sequencing a short fragment of mitochondrial genes that are then compared against a large database of pre-identified species, known as reference libraries. Success has been found in identifying genetic diversity among a wide variety of species, including plants, fungi, insects, fish and larger animals, such as giant pandas.
With reference libraries growing exponentially, there is a need to develop meaningful ways to analyze the vast amounts of available DNA barcoding information, which can then be applied in fields ranging from forensics to resource conservation.
A primary understanding of DNA barcoding is what is known as the barcoding gap. The DNA barcoding gap asserts that the variation found across species is higher than the genetic variability seen within a species. However, the interpretation of DNA barcoding gaps often lacks statistical rigor.
Computational Statistics Reduces DNA Barcoding Errors
University of Guelph School of Computer Science professor Dr. Daniel Gillis [1], Department of Integrative Biology professor Dr. Robert Hanner [2] and postdoctoral fellow Dr. Jarrett Phillips [3] looked into DNA barcoding interpretation, focusing on the DNA barcode gap.
“We investigated statistical approaches to better characterize the DNA barcode gap through barcoding Canadian Pacific fishes, including Sockeye salmon,” says Phillips. “We selected these native Pacific fish species because many of them hold strong socioeconomic and conservation importance within Canada and globally, particularly as food commodities.”
The team explored using frequency histograms and dotplots to better visualize the DNA barcode gap. Frequency histograms and dotplot are just two types of graphs that provide much needed rigor.
Future Research to Look at Sample Size, Interpretation Procedures
To interpret the DNA barcode gap, a higher species sampling size is required to capture genetic variation, and more appropriate interpretation procedures need to be developed to draw conclusions when limited DNA sequence data is available for a species. Gillis and team also noted that better descriptive statistics (minimum, maximum and average) are needed to summarize genetic sequence data. They also conclude that mixed statistical models, which are considered flexible models that have the ability to account for variability, could be used to tease out genetic differences observed among and within species.
“Species-level discrimination is challenging (or unreliable) without extensive reference libraries,” says Gillis. “The methods we outlined have the potential to open closed doors, giving biodiversity researchers and regulatory scientists an unprecedented view of patterns left in DNA sequences from key evolutionary mechanisms and processes responsible for shaping Earth's biodiversity over millions of years.”
Dr, Daniel Gillis is an Associate Professor and Statistician in the School of Computer Science
Dr. Jarret Phillips is a Post Doctoral Fellow under Dr. Robert Hanner in the Department of Integrative Biology
This work was supported by the University of Guelph College of Engineering and Physical Sciences (CEPS) Graduate Excellence Entrance Scholarship.
Phillips J, Gillis D, Hanner R. Lack of Statistical Rigor in DNA Barcoding Likely Invalidates the Presence of a True Species’ Barcode Gap [4]. Front. Ecol. Ecol. 2022 Apr 14. doi: 10.3389/fevo.2022.859099.