ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

Pairwise Sequence Alignment using Bio-Database Compression by Improved Fine Tuned Enhanced Suffix Array

Journal: The International Arab Journal of Information Technology (Vol.12, No. 4)

Publication Date:

Authors : ; ; ;

Page : 352-359

Keywords : Sequence alignment; enhanced suffix array; compression; minimum perfect hash function; data mining.;

Source : Downloadexternal Find it from : Google Scholarexternal

Abstract

Sequence alignment is a bioinformatics application that determines the degree of similarity between nucleotide sequences which is assumed to have same ancestral relationships. This sequence alignment method reads query sequence from the user and makes an alignment against large and genomic sequence data sets and locate targets that are similar to an input query sequence. Existing accurate algorithm, such as smith-waterman and FASTA are computationally very expensive, which limits their use in practice. The existing search tools, such as BLAST and WUBLAST, employ heuristics to improve the speed of such searches. However, such heuristics can sometimes miss targets, in which many cases are undesirable. Considering the rapid growth of database sizes, this problem demands evergrowing computation resources and remains as a computational challenge. Most common sequence alignment algorithms like BLAST, WU-BLAST and Sequance Comparasion Tool (SCT) searches a given query sequence against set of database sequences. In this paper, Biological Data Base Compression Tool using Minimum Perfect Hash Function (BioDBMPHF) tool has been developed to find pair wise local sequence alignment by preprocessing the database. Preprocessing is done by means of finding Longest Common Substring (LCS) from the database of sequences that have the highest local similarity with a given query sequence and reduces the size of the database based on frequent common subsequence. In this BioDBMPHF tool fine-tuned enhanced suffix array is constructed and used to find LCS. Experimental results show that hash index algorithm reduces the time and space complexity to access LCS. Time complexity to find LCS of the hash index algorithm is O(2+γ) where ‘γ' is the time taken to access the pattern. Space complexity of fine-tuned enhanced suffix array is 5n bytes per character for reduced enhanced Longest Common Prefix (LCP) table and to store bucket table it requires 32 bytes. Data mining technique is used to cross validate the result. It is proved that the developed BioDBMPHF tool effectively compresses the database and obtains same results compared to that traditional algorithm in approximately half the time taken by them thereby reducing the time complexity

Last modified: 2019-11-14 22:52:50