Semantic Similarity based Web Document Classification Using Support Vector Machine

Journal: The International Arab Journal of Information Technology (Vol.14, No. 3)

Publication Date: 2017-05-01

Authors : Kavitha Chinniyan; Sudha Gangadharan; Kiruthika Sabanaikam;

Page : 285-292

Keywords : Document classification; text mining; SVM; latent semantic indexing.;

Source : Download Find it from : Google Scholar

Abstract

With the rapid growth of information on the World Wide Web (WWW), classification of web documents has become important for efficient information retrieval. Relevancy of information retrieved can also be improved by considering semantic relatedness between words which is a basic research area in fields of natural language processing, intelligent retrieval, document clustering and classification, word sense disambiguation etc. The web search engine based semantic relationship from huge web corpus can improve classification of documents. This paper proposes an approach for web document classification that exploits information, including both page count and snippets. To identify the semantic relations between the query words, a lexical pattern extraction algorithm is applied on snippets. A sequential pattern clustering algorithm is used to form clusters of different patterns. The page count based measures are combined with the clustered patterns to define the features extracted from the word-pairs. These features are used to train the Support Vector Machine (SVM), in order to classify the web documents. Experimental results demonstrate 5% and 9% improvement in F1 measure for Reuters 21578 and 20 Newsgroup datasets in the classifier performance

Main Menu

Searching By

PARTNERS

Semantic Similarity based Web Document Classification Using Support Vector Machine

Abstract

Advertisement