ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

Improving the Performance of Crawler Using Body Text Normalization

Journal: COMPUSOFT 'An International Journal of Advanced Computer Technology' (Vol.2, No. 7)

Publication Date:

Authors : ;

Page : 215-220

Keywords : Crawler with url normalization; Crawler with whole page content MD5; Crawler with Body text normalization.;

Source : Downloadexternal Find it from : Google Scholarexternal


Search engine is comprised of components like crawler, repository, indexing, querying and ranking. Work of crawler is to crawl the web and download pages. These pages are then stored in repository. The crawler mechanism should be smart enough to identify the pages that it had or had not crawled before. Here we propose a suitable mechanism that will avoid downloading of duplicate page contents and also avoid unnecessary URL extraction time. So as to meet the desired mechanism we introduce MD5 digest of body text of every page.

Last modified: 2013-08-11 21:01:32