Study and Analysis on Document Clustering Based on MapReduce in Hadoop using K-Mean Algorithm
Journal: International Journal of Science and Research (IJSR) (Vol.4, No. 8)Publication Date: 2015-08-05
Authors : Yashika Verma; Sumit Kumari;
Page : 176-180
Keywords : Hadoop; Mapreduce; Document Clustering; Direct K-Means; Distributed K-Means; Large DataSet;
Abstract
Document clustering is an effective tool to manage information overload. By grouping similar documents together, we enable a human observer to quickly browse large document collections, make it possible to easily grasp the distinct topics and subtopics in them, allow search engines to efficiently query large document collections among many other applications. Hence, it has been widely studied as a part of the broad literature of data clustering. MapReduce is a simplified programming model of distributed parallel computing. It is an important technology of Google, and is commonly used for data-intensive distributed parallel computing. In this paper, we describe how document clustering for large collection can be efficiently implemented with MapReduce. Hadoop implementation provides a convenient and flexible framework for distributed computing on cluster of commodity machines. The design and implementation of direct K-Means and Distributed K-means algorithm on MapReduce is presented.
Other Latest Articles
- Efficacy of Planned Teaching on Knowledge Regarding Tracheostomy Suctioning Among Staff Nurses
- Efficacy of Planned Teaching on Knowledge Regarding Diet Practices Leading to Obesity and Its Hazards among Middle Aged Women
- Correlation and Path Coefficient Analysis for Yield Attributes in Lentil (Lence culinaris L.)
- Diversity, Species Richness and Evenness of Arctiidae and Geometriidae Moth Fauna of Barpeta and Cachar District
- Production of Bioethanol from Lignocellulosic Biomass by Simultaneous Saccharification and Fermentation
Last modified: 2021-06-30 21:52:09