A Methodology for Enhancing Template Extraction accuracy Of Heterogeneous Web Pages
Journal: INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY (Vol.4, No. 2)Publication Date: 2013-01-01
Authors : Vidya Kadam; Prakash R. Devale;
Page : 192-198
Keywords : Template Extraction; Clustering; MDL; Text-MAX; Text-Hash; Jaccard Coefficient; Dice Coefficient.;
Abstract
Today websites contain large number of pages generated using the common templates with contents. Due to irrelevant terms in templates they degrades the accuracy of web application. Thus, template detection techniques have received a lot of attention recently to enhance the accuracy. To extract the template from these heterogeneous templates we use different algorithms to find the similarity of underlying structure in the documents, so that the template is extracted with various clusters. We implement various algorithms to find similarity between the web pages. Earlier the algorithms used are Text Hash and Text Max with jaccard coefficient. But the time and space occupied by this algorithm is more. In this paper, we implement Text Hash and Text Max with Jaccard as well as Dice coefficient. The space and time occupied by Dice coefficient is less as compared to Jaccard coefficient.
Other Latest Articles
- A framework for a multi-tier Internet Service architecture for doctors’ directory
- Biofield Treatment: An Alternative Approach to Combat Multidrug-Resistant Susceptibility Pattern of Raoultella ornithinolytica
- History of Research on Pharmacopuncture in Korea
- Modulation of the Expression of the GABAA Receptor β1 and β3 Subunits by Pretreatment with Quercetin in the KA Model of Epilepsy in Mice -The Effect of Quercetin on GABAA Receptor Beta Subunits-
Last modified: 2016-06-30 13:43:54