The Study of Detecting Replicate Documents Using MD5 Hash Function
Journal: International Journal of Advanced Computer Research (IJACR) (Vol.1, No. 2)Publication Date: 2011-12-24
Authors : Pushpendra Singh Tomar; Maneesh Shreevastava;
Page : 14-17
Keywords : Unique documents; detecting replicate; replication; search engine.;
Abstract
A great deal of the Web is replicate or near- replicate content. Documents may be served in different formats: HTML, PDF, and Text for different audiences. Documents may get mirrored to avoid delays or to provide fault tolerance. Algorithms for detecting replicate documents are critical in applications where data is obtained from multiple sources. The removal of replicate documents is necessary, not only to reduce runtime, but also to improve search accuracy. Today, search engine crawlers are retrieving billions of unique URL’s, of which hundreds of millions are replicates of some form. Thus, quickly identifying replicate detection expedites indexing and searching. One vendor’s analysis of 1.2 billion URL’s resulted in 400 million exact replicates found with a MD5 hash. Reducing the collection sizes by tens of percentage point’s results in great savings in indexing time and a reduction in the amount of hardware required to support the system. Last and probably more significant, users benefit by eliminating replicate results. By efficiently presenting only unique documents, user satisfaction is likely to increase.
Other Latest Articles
- Secure Three Prime RSA from Hardware Fault Attack
- A Novel Data Gathering Protocol with Node Detection and Sharing (DGPNDS) in Mobile-Sink Based Java Environment
- A Decrease of Reelin Expression in Neonatal Ventral Hippocampal Lesion Model
- Synthesis and Characterization of Co3O4 Nanotubes to Prepare Variety of
- Optimizing Transfection of Umbilical Cord Mesenchymal Stem Cells Utilizing
Last modified: 2014-11-19 17:48:27