Web Content Mining Based on Dom Intersection and Visual Features Concept
Journal: International Journal of Computational Engineering Research(IJCER) (Vol.5, No. 10)Publication Date: 2015-10-18
Authors : Shaikh Phiroj Chhaware; Dr.Mohammad Atique; Latesh. G. Malik;
Page : 13-20
Keywords : Document Object Model; Web Data Extraction; Visual Features; Template Detection; Webpage Intersection; Data Regions; Data Reco;
Abstract
Structured Data extraction from deep Web pages is a challenging task due to the underlying complex structures of such pages. Also website developer generally follows different web page design technique. Data extraction from webpage is highly useful to build our own database from number applications. A large number of techniques have been proposed to address this problem, but all of them have inherent limitations because they present different limitations and constraints for extracting data from such webpages. This paper presents two different approaches to get structured data extraction. The first approach is non-generic solution which is based on template detection using intersection of Document Object Model Tree of various webpages from the same website. This approach is giving better result in terms of efficiency and accurately locating the main data at the particular webpage. The second approach is based on partial tree alignment mechanism based on using important visual features such as length, size, and position of web table available on the webpages. This approach is a generic solution as it does not depend on one particular website and its webpage template. It is perfectly locating the multiple data regions, data records and data items within a given web page. We have compared our work’s result with existing mechanism and found our result much better for number webpage.
Other Latest Articles
- Efficient Resource Allocation to Virtual Machine in Cloud Computing Using an Advance Algorithm
- Modeling and Analysis of Flexible Manufacturing System with FlexSim
- Understanding Ubl-Rpn1 Intermolecular Interaction
- Evaluating The Efficacy Of Gene Silencing In Dopaminergic Neuronal Cells In-Vitro Using Gold Nanorods (GNR) With Different Surface Properties Complexed To DARPP-32 SiRNA
- Effect Of Nonionic Surfactants And HPMC F4M On The Development Of Formulations Of Neuro-EPO As A Neuroprotective Agent
Last modified: 2015-11-18 20:15:54