PhD Seminar: Sajid Marhon
School of Computer Science PhD student Sajid Marhon will present a seminar on Thursday February 26 at 1:00pm in Reynolds Room 219.
Development of a New, DSP-based Gene Prediction Technique
Detecting protein-coding regions is an essential step in genome analysis. This step opens the door to analyzing protein sequences. Different methods have been proposed for detecting protein-coding regions in DNA sequences. Some methods use probabilistic models such as hidden Markov models (HMMs) that depend on homology information (a learning model) in the analysis of DNA sequences, and this makes the application of these methods restricted to species that have homologs. In addition, HMM performance is coupled to the homology between DNA sequences to be labeled and those in the training datasets. Other methods use sequence similarity to annotate unknown sequences by comparing them with already annotated sequences. The two kinds of methods that have been mentioned so far are classified as model-dependent methods. Digital Signal Processing (DSP)-based methods, which rely on the spectrum analysis of DNA sequences, extract the period-3 spectrum of DNA sequences to detect protein-coding regions. DSP methods are classified as model-independent methods. The accuracy of these methods is still moderate due to some parameters that obstruct the performance. In this research, we propose an improved DSP-based method that overcomes the limitation in the prediction accuracy of the current DSP-based methods and the problem of application specificity of learning-based methods. In our method, we propose four improvements in different stages of the gene prediction process. In the sequence mapping stage, we propose an adaptive representation scheme that assigns different numerical values to the four nucleotides depending on the informativeness of a nucleotide in the period-3 spectrum. In extracting the period-3 spectrum property, we propose using the nucleotide distribution variance tool, which requires less computational time than the discrete Fourier transform (DFT). Our method also includes post processing of the period-3 spectrum signal in order to enhance this property and attenuate background noise. In addition, the post processing includes detecting period-3 spectrum peaks and avoids using an experimental threshold value for detecting protein-coding regions. The experimental results show that our technique outperforms other DSP-based techniques. In addition, the comparison of the results of the proposed technique with the HMMgene technique, which we have re-implemented based on the literature, shows that our technique performs better on the novel gene detection problem. Another advantage of the proposed method is that it can discover regions which are not protein coding, but are likely to contain some other functional patterns in similar DNA sequences. Our experimental analysis has shown that HMMgene, by contrast, could not detect any coding region or biological feature in these regions. The technique also highlights and explores the capabilities of techniques that perform better than homology-based techniques for de novo protein prediction. We believe that this is an area of research that has been underemphasized and deserves additional attention.