ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

A FAST CLUSTERING BASED FEATURE ALGORITHM FOR HIGH DIMENSIONAL DATA

Journal: International Journal of Computer Science and Mobile Applications IJCSMA (Vol.2, No. 12)

Publication Date:

Authors : ; ;

Page : 23-29

Keywords : Data mining; Feature selection; FAST algorithm; relevant features; redundant features;

Source : Downloadexternal Find it from : Google Scholarexternal

Abstract

Clustering which tries to group a set of points into clusters such that points in the same cluster are more similar to each other than points in different clusters, under a particular similarity metric. In the generative clustering model, a parametric form of data generation is assumed, and the goal in the maximum likelihood formulation is to find the parameters that maximize the probability (likelihood) of generation of the data given the model. In the most general formulation, the number of clusters k is also considered to be an unknown parameter. Such a clustering formulation is called a “model selection” framework, since it has to choose the best value of k under which the clustering model fits the data. In clustering process, semi-supervised learning is a class of machine learning techniques that make use of both labeled and unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data. Semi supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data). Feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Traditional approaches for clustering data are based on metric similarities, i.e., nonnegative, symmetric, and satisfying the triangle inequality measures using graph-based algorithm to replace this process here we select more recent approaches, like Affinity Propagation (AP) algorithm can take as input also general.

Last modified: 2014-12-17 22:27:17