Discovering Informative Blocks from Web Pages for Efficient Information Extraction using DOM tree

Journal: International Journal of Computational and Electronic Aspects in Engineering (Vol.1, No. 2)

Publication Date: 2015-03-30

Authors : Rakesh M. Kohale; Shreyash G. Balbudhe;

Page : 23-25

Keywords : DOM tree; Site Style Tree; Tokens; Parsing; Informative blocks; Non-informative blocks;

Source : Download Find it from : Google Scholar

Abstract

A webpage generally contains data along with navigation panels, advertisements, copyright and privacy notices. Except data these other things does not contain any important information. These blocks can be called as non-informative blocks. As these blocks are non-informative, they can affect the result of web data mining. To avoid this it is important to separate the main data i.e. informative blocks and noninformative blocks from the web page. In a website these non-informative blocks are generally present in different web pages and have same format. Also the data contained in these blocks is also same. In case of informative blocks, data contained by the block and their format are different. We need a structure at site level to capture the same format of the blocks and the data present in the blocks. DOM Tree structure is available at page level. Many tools are available to construct a DOM Tree of a webpage. But DOM Tree structure is not useful at site level. So we need to construct a Site Style Tree (SST) for a website. After analyzing this SST we can identify which part of SST is informative and which is non-informative. There is no tool available to construct a style tree for a given website. This work aims at constructing a style tree for given website and separating informative and non-informative blocks from the website.

Main Menu

Searching By

PARTNERS

Discovering Informative Blocks from Web Pages for Efficient Information Extraction using DOM tree

Abstract

Advertisement