ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

A Methodology for Enhancing Template Extraction accuracy Of Heterogeneous Web Pages

Journal: INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY (Vol.4, No. 2)

Publication Date:

Authors : ; ;

Page : 192-198

Keywords : Template Extraction; Clustering; MDL; Text-MAX; Text-Hash; Jaccard Coefficient; Dice Coefficient.;

Source : Download Find it from : Google Scholarexternal

Abstract

Today websites contain large number of pages generated using the common templates with contents. Due to irrelevant terms in templates they degrades the accuracy of web application. Thus, template detection techniques have received a lot of attention recently to enhance the accuracy. To extract the template from these heterogeneous templates we use different algorithms to find the similarity of underlying structure in the documents, so that the template is extracted with various clusters. We implement various algorithms to find similarity between the web pages. Earlier the algorithms used are Text Hash and Text Max with jaccard coefficient. But the time and space occupied by this algorithm is more. In this paper, we implement Text Hash and Text Max with Jaccard as well as Dice coefficient. The space and time occupied by Dice coefficient is less as compared to Jaccard coefficient.

Last modified: 2016-06-30 13:43:54