ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

Semantic Similarity based Web Document Classification Using Support Vector Machine

Journal: The International Arab Journal of Information Technology (Vol.14, No. 3)

Publication Date:

Authors : ; ; ;

Page : 285-292

Keywords : Document classification; text mining; SVM; latent semantic indexing.;

Source : Downloadexternal Find it from : Google Scholarexternal

Abstract

With the rapid growth of information on the World Wide Web (WWW), classification of web documents has become important for efficient information retrieval. Relevancy of information retrieved can also be improved by considering semantic relatedness between words which is a basic research area in fields of natural language processing, intelligent retrieval, document clustering and classification, word sense disambiguation etc. The web search engine based semantic relationship from huge web corpus can improve classification of documents. This paper proposes an approach for web document classification that exploits information, including both page count and snippets. To identify the semantic relations between the query words, a lexical pattern extraction algorithm is applied on snippets. A sequential pattern clustering algorithm is used to form clusters of different patterns. The page count based measures are combined with the clustered patterns to define the features extracted from the word-pairs. These features are used to train the Support Vector Machine (SVM), in order to classify the web documents. Experimental results demonstrate 5% and 9% improvement in F1 measure for Reuters 21578 and 20 Newsgroup datasets in the classifier performance

Last modified: 2019-05-08 18:07:53