ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

A Similarity Measure for Documents Using Clustering Technique

Journal: International Journal of Computer Science and Mobile Computing - IJCSMC (Vol.7, No. 12)

Publication Date:

Authors : ; ; ; ;

Page : 239-248

Keywords : Clustering; Jaccard similarity; Cosine similarity; Euclidean measure; Correlation coefficient; K-means;

Source : Downloadexternal Find it from : Google Scholarexternal

Abstract

Text clustering is a critical use of information mining. It is worried about gathering comparable content archives together. Content report grouping assumes a vital job in giving natural route and perusing systems by sorting out a lot of data into few important clusters. Grouping technique needs to implant the reports in an appropriate similitude space. In this paper we look at four prominent similitude measures: cosine similarity, Jaccard similarity, Euclidean distance and Correlation Coefficient related to various sorts of vector space portrayal (Boolean, term recurrence and reverse report recurrence) of archives. Clustering of archives is performed utilizing summed up k-Means; a Partitioned constructed grouping strategy in light of high dimensional inadequate information speaking to content reports. Execution is estimated against a human-forced arrangement of Topic and Place classes. We led various tests and utilized entropy measure to guarantee factual noteworthiness of results. Cosine, Pearson relationship and Jaccard similitude rise as the best measures to catch human categorization conduct, while Euclidean measures perform poor.

Last modified: 2018-12-30 18:17:10