ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

Performance Assessment of Various Text Document Features through K-Means Document Clustering Approach

Journal: International Journal of Advanced Trends in Computer Science and Engineering (IJATCSE) (Vol.8, No. 5)

Publication Date:

Authors : ; ;

Page : 1969-1977

Keywords : TFIDF; Word Frequency; Probability; pre-processing; Clustering; K-Means.;

Source : Downloadexternal Find it from : Google Scholarexternal

Abstract

The text documents are very important in the usage of www. Many users require so much text document to gather the information in their required field of interest. To serve the internet surfers the appropriate required topic documents are to be retrieved. For this purpose for indexing and retrieving the text document the researchers tend to produce many algorithms in the field of text document mining. The entire effort of clustering is achieved relying on the selection of appropriate similarity metrics. The document clustering is performed by two steps which begin with feature extraction prior to clustering operation. The features from the text document are extracted through various operations like preprocessing, tokenization, Stop word removal, streaming and bag of Words were performed. By performing the previous mentioned operations successively the Document representing features namely WordCount, TF_IDF and probability of words were determined to perform the next process with clustering algorithm. In the clustering phase the three features and some of the similarity measures were used to perform the clustering operation. The proposed method yields better results for Probability based features compared with other two TFIDF and WordCount.

Last modified: 2019-11-11 17:45:25