ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

Detection of Duplicate and Near-Duplicate Content for Web Crawlers

Journal: Journal of Independent Studies and Research - Computing (Vol.13, No. 2)

Publication Date:

Authors : ;

Page : 30-38

Keywords : ;

Source : Downloadexternal Find it from : Google Scholarexternal

Abstract

There is an abundance of duplicated web documents on the internet. For example, two documents online could be very similar to each other except for a very small portion, such as URLs and advertisements. While such differences are not important with regards to web searches, they do tamper with web search results due to duplication. Therefore, if web crawlers could check the duplication percentage of newly crawled pages by a previ- ously crawled page, the quality of web search will signifi- cantly increase. The main objective of this research is to propose a method which is able to check the duplication ratio of the content on the page with the one already crawled previously. The solution includes running a web crawling algorithm in order to calculate the ratio of duplication at the time of web crawling. In order to effectively achieve the goals of this research, Charikar's SIMHASH finger print- ing-technique has been used. Using this, a new technique for the purpose of detection of exact and near duplication method will be devised which will work to check the duplica- tion ratio with the newly crawled page. The experiment is carried out on multiple pages of two major B2B website namely Ali Baba and Trade key. More than 300 pages from two similar categories on each portal were selected for this experiment. These selected pages were first calculated using a third party duplication detection tool to set the bench mark. The results obtained from the test looked to be very promising and close to the benchmark set. The system running time was very short. However, the results show an average curve variation of 10% away from the bench mark which in this case is fine. Based on the results obtained from the experiment carried out, it can be said that Charikar's SIMHASH finger printing technique can be effectively used to detect duplication and near duplication.

Last modified: 2018-07-17 00:53:25