The Study of Detecting Replicate Documents Using MD5 Hash Function

Journal: International Journal of Advanced Computer Research (IJACR) (Vol.1, No. 2)

Publication Date: 2011-12-24

Authors : Pushpendra Singh Tomar; Maneesh Shreevastava;

Page : 14-17

Keywords : Unique documents; detecting replicate; replication; search engine.;

Source : Download Find it from : Google Scholar

Abstract

A great deal of the Web is replicate or near- replicate content. Documents may be served in different formats: HTML, PDF, and Text for different audiences. Documents may get mirrored to avoid delays or to provide fault tolerance. Algorithms for detecting replicate documents are critical in applications where data is obtained from multiple sources. The removal of replicate documents is necessary, not only to reduce runtime, but also to improve search accuracy. Today, search engine crawlers are retrieving billions of unique URL’s, of which hundreds of millions are replicates of some form. Thus, quickly identifying replicate detection expedites indexing and searching. One vendor’s analysis of 1.2 billion URL’s resulted in 400 million exact replicates found with a MD5 hash. Reducing the collection sizes by tens of percentage point’s results in great savings in indexing time and a reduction in the amount of hardware required to support the system. Last and probably more significant, users benefit by eliminating replicate results. By efficiently presenting only unique documents, user satisfaction is likely to increase.

Main Menu

Searching By

PARTNERS

The Study of Detecting Replicate Documents Using MD5 Hash Function

Abstract

Advertisement