Improving the Performance of Crawler Using Body Text Normalization
Journal: COMPUSOFT 'An International Journal of Advanced Computer Technology' (Vol.2, No. 7)Publication Date: 2013-06-26
Authors : Farha Qureshi;
Page : 215-220
Keywords : Crawler with url normalization; Crawler with whole page content MD5; Crawler with Body text normalization.;
Abstract
Search engine is comprised of components like crawler, repository, indexing, querying and ranking. Work of crawler is to crawl the web and download pages. These pages are then stored in repository. The crawler mechanism should be smart enough to identify the pages that it had or had not crawled before. Here we propose a suitable mechanism that will avoid downloading of duplicate page contents and also avoid unnecessary URL extraction time. So as to meet the desired mechanism we introduce MD5 digest of body text of every page.
Other Latest Articles
Last modified: 2013-08-11 21:01:32