ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

Generative Topic Modeling in Taxonomic Structure of Genomic Data using LDA?

Journal: International Journal of Computer Science and Mobile Computing - IJCSMC (Vol.3, No. 7)

Publication Date:

Authors : ; ;

Page : 832-840

Keywords : Data mining; Bioinformatics (genome or protein) databases; Language models; Metagenomics;

Source : Downloadexternal Find it from : Google Scholarexternal

Abstract

Probabilistic topic models have been developed for applications in various domains such as text mining, information retrieval. In this work, we focus on developing probabilistic topic models for LDA and specifically, a probabilistic topic model is proposed for data analysis and function analysis using homogenous approach and composite approach. In this paper, we aim to develop a new method that is able to analyze the genome-level composition of DNA sequences, in order to characterize a set of common genomic features shared by the same species and tell their functional roles. To achieve this end, we firstly apply a We firstly show that generative topic model can be used to model the taxon abundance information obtained by homology based approach and study the microbial core. The model considers each sample as a ‘document’, which has a mixture of functional groups, while each functional group (also known as a ‘latent topic’) is a weight mixture of species. Therefore, estimating the generative topic model for taxon abundance data will uncover the distribution over latent functions (latent topic) in each sample. Secondly composition-based approach to break down DNA sequences into sub-reads called the ‘N-mer’ and represents the sequences by N-mer frequencies. Then, we introduce the Latent DirichletAllocation (LDA) model to study the genome-level statistic patterns (a.k.a. latent topics) of the ‘N-mer’ features. Each estimated latent topic represents a certain component of the whole genome.

Last modified: 2014-07-30 23:40:41