A survey on Design and Implementation of Clever Crawler Based On DUST RemovalJournal: International Journal for Scientific Research and Development | IJSRD (Vol.3, No. 10)
Publication Date: 2016-01-01
Authors : Kanchan S. Khedkar; P. L. Ramteke;
Page : 744-746
Keywords : DUSTER; web crawling; Dust Buster;
Now days, World Wide Web has become a popular medium to search information, business, trading and so on. A well know problem face by web crawler is the existence of large fraction of distinct URL that correspond to page with duplicate or nearby duplicate contents. In fact as estimated about 29% of web page are duplicates. Such URL commonly named as dust represent an important problem in search engines. To deal with this problem, the first efforts is focus on comparing document content to detect and remove duplicate document without fetching their contents .To accomplish this, the proposed methods learn normalization rules to transform all duplicate URLs into the same canonical form. A challenging aspect of this strategy is deriving a set of general and precise rules. The new approach to detect and eliminate redundant content is DUSTER .When crawling the web duster take advantage of a multi sequence alignment strategy to learn rewriting rules able to transform to other URL which likely to have same content . Alignment strategy that can lead to reduction of 54% larger in the number of duplicate URL.
Other Latest Articles
Last modified: 2016-01-09 20:17:12