Efficient DNA Sequence Classification through MachineLearning Techniques

Abstract
In the domain of computational biology and biomedical data analysis, classifying DNA sequences is a significant challenge. Identifying and classifying DNA sequences of various species is of utmost importance. Various Machine Learning (ML) techniques have been successfully applied to this task recently. This study introduces a new approach for effectively categorizing valid DNA sequences from unrelated sequences using different ML techniques. The valid datasets were systematically collected from the NCBI database, while the unrelated datasets were generated using random techniques. Various ML techniques were then applied to distinguish between these two categories. It was observed that Gradient Boosting Machine (GBM) performed the best, achieving 0.971 accuracy and a 0.975 F1 score. The outcome of XGBoost is also good that achieving 0.935 Accuracy and 0.93 F1 Score. It is also observed that this method consistently achieves the best execution time when compared to other existing machine learning methods. The results were also verified using a Phylogenetic Tree constructed through Clustal Omega, a well-known traditional alignment-based method for DNA sequence comparison. In both cases, the results were consistent, although Clustal Omega had a much higher execution time compared to the present method. Therefore, the proposed technique significantly enhances the efficiency of DNA sequence classification.
Keywords: Bioinformatics, Clustal Omega, DNA Sequences Comparison, Machine Learning, Phylogenetic Tree.

Author(s): Papri Ghosh, Subhram Das*, Debrupa Pal, Soumyabrata Saha, Suparna Dasgupta
Volume: 6 Issue: 3 Pages: 1180-1189
DOI: https://doi.org/10.47857/irjms.2025.v06i03.04915