ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

An Efficient Approach for Clustering High Dimensional Data

Journal: International Journal of Scientific and Technical Advancements (IJSTA) (Vol.2, No. 1)

Publication Date:

Authors : ; ;

Page : 37-43

Keywords : Clustering; BAT algorithm; k-medoids; HADOOP;

Source : Downloadexternal Find it from : Google Scholarexternal

Abstract

Big data analytics allows a small number of users to burn a large amount of money very fast. The problem is exacerbated by the exploratory nature of big data analytics where queries are iteratively refined, including the submission of many erroneous (e.g., big streaming data cluster). In existing systems, clustering must complete after downloading, often after several hours of expensive compute time are used for clustering. This project had shown that it is both progressive fetching and clustering to support incremental query interactions for data analysts. High Dimensional (HD) clustering has been successfully used in a lot of clustering problems. However, most of the applications deal with static data. This project considers how to apply HD in incremental clustering problems. Clustering data by identifying a subset of representative examples is important for detecting patterns in data and in processing sensory signals. Such “exemplars” can be found by randomly choosing an initial subset of data points as exemplars and then iteratively refining it, but this works well only if that initial choice is close to a good solution. This thesis describes a method called “Big Data Clustering using k-Mediods BAT Algorithm” KMBAT, that simultaneously considers all data points as potential exemplars, exchanging real-valued messages between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. KMBAT takes as input a set of pairwise similarities between data points and finds clusters on the basis of maximizing the total similarity between data points and their exemplars. Similarity can be simply defined as negative squared Euclidean distance for compatibility with other algorithms, or it can incorporate richer domain-specific models (e.g., translation-invariant distances for comparing images). KMBAT’s computational and memory requirements scale linearly with the number of similarities input; for non-sparse problems where all possible similarities are computed, these requirements scale quadratic ally with the number of data points from big data which is streamed. KMBAT is demonstrated on FACEBOOK social network user profile data, which is stored in a big data HDInsight server and cluster with KMBAT which finds better clustering solutions than other methods in less time.

Last modified: 2016-02-13 12:54:04