Web Content Mining Based on Dom Intersection and Visual Features Concept

Journal: International Journal of Computational Engineering Research(IJCER) (Vol.5, No. 10)

Publication Date: 2015-10-18

Authors : Shaikh Phiroj Chhaware; Dr.Mohammad Atique; Latesh. G. Malik;

Page : 13-20

Keywords : Document Object Model; Web Data Extraction; Visual Features; Template Detection; Webpage Intersection; Data Regions; Data Reco;

Source : Download Find it from : Google Scholar

Abstract

Structured Data extraction from deep Web pages is a challenging task due to the underlying complex structures of such pages. Also website developer generally follows different web page design technique. Data extraction from webpage is highly useful to build our own database from number applications. A large number of techniques have been proposed to address this problem, but all of them have inherent limitations because they present different limitations and constraints for extracting data from such webpages. This paper presents two different approaches to get structured data extraction. The first approach is non-generic solution which is based on template detection using intersection of Document Object Model Tree of various webpages from the same website. This approach is giving better result in terms of efficiency and accurately locating the main data at the particular webpage. The second approach is based on partial tree alignment mechanism based on using important visual features such as length, size, and position of web table available on the webpages. This approach is a generic solution as it does not depend on one particular website and its webpage template. It is perfectly locating the multiple data regions, data records and data items within a given web page. We have compared our work’s result with existing mechanism and found our result much better for number webpage.

Main Menu

Searching By

PARTNERS

Web Content Mining Based on Dom Intersection and Visual Features Concept

Abstract

Advertisement