STRATEGIES FOR AUTOMATIC DETERMINATION OF SIMILARITY THRESHOLD FOR GENRE-AWARE FOCUSED CRAWLING PROCESSES
Journal: IADIS INTERNATIONAL JOURNAL ON WWW/INTERNET (Vol.15, No. 1)Publication Date: 2017-07-01
Authors : Gustavo Oliveira de Siqueira Guilherme Tavares de Assis Anderson Almeida Ferreira Amanda Sávio Nascimento e Silva Vítor Mangaravite; Flávio Luis Cardeal Pádua;
Page : 15-30
Keywords : Similarity threshold; web crawling; focused crawling;
Abstract
The great popularity and, specially, the fast Web growth have led to the proposal and analysis of new techniques for helping users to locate effectively the needed information in a satisfactory time, without much difficulty. Traditional crawlers are not capable to identify relevant sub-spaces on Web related to a specific theme; however, focused crawlers are capable to solve, effectively and efficiently, the mentioned problem. Usually, a focused crawler process requires a specific value, called similarity threshold value, for determining whether a crawled Web page is relevant or not according to a topic of interest; such value is distinct for each specific topic. In order to determine automatically such a value for focused crawlers related to a genre-aware approach, we propose three strategies in this work. Our experimental evaluation achieved, as the best result, 100% of precision and 98% of F1, considering a specific crawling process for which it was determined automatically a similarity threshold value: a great result compared with the baseline.
Other Latest Articles
- ONLINE REPUTATION MANAGEMENT SYSTEMS FOR HEALTHCARE ORGANIZATIONS
- Preconcepción de la Educación Ambiental a través de las Representaciones Sociales del Docente
- Efficient snapshot method for all-flash array
- Evaluating user vulnerabilities vs phisher skills in spear phishing
- Automatic generation of ontologies: a hierarchical word clustering approach
Last modified: 2019-12-13 21:43:08