ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

STRATEGIES FOR AUTOMATIC DETERMINATION OF SIMILARITY THRESHOLD FOR GENRE-AWARE FOCUSED CRAWLING PROCESSES

Journal: IADIS INTERNATIONAL JOURNAL ON WWW/INTERNET (Vol.15, No. 1)

Publication Date:

Authors : ; ;

Page : 15-30

Keywords : Similarity threshold; web crawling; focused crawling;

Source : Downloadexternal Find it from : Google Scholarexternal

Abstract

The great popularity and, specially, the fast Web growth have led to the proposal and analysis of new techniques for helping users to locate effectively the needed information in a satisfactory time, without much difficulty. Traditional crawlers are not capable to identify relevant sub-spaces on Web related to a specific theme; however, focused crawlers are capable to solve, effectively and efficiently, the mentioned problem. Usually, a focused crawler process requires a specific value, called similarity threshold value, for determining whether a crawled Web page is relevant or not according to a topic of interest; such value is distinct for each specific topic. In order to determine automatically such a value for focused crawlers related to a genre-aware approach, we propose three strategies in this work. Our experimental evaluation achieved, as the best result, 100% of precision and 98% of F1, considering a specific crawling process for which it was determined automatically a similarity threshold value: a great result compared with the baseline.

Last modified: 2019-12-13 21:43:08