ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

Analyzing the Impact of Preprocessing Techniques on Arabic Document Classification: Comparative Study

Journal: International Journal of Advanced Trends in Computer Science and Engineering (IJATCSE) (Vol.13, No. 6)

Publication Date:

Authors : ;

Page : 228-240

Keywords : Natural Language Processing; Machine Learning; Preprocessing; Document Classification; Naïve Bayes; KNN; SVM; Tokenization; Normalization; Stop Words; Stemming.;

Source : Downloadexternal Find it from : Google Scholarexternal

Abstract

Texts classification is an important field that can be used in data mining, information retrieval, machine learning. Documents classification now widely used in different domains, such as mail spam filtering, article indexing, Web searching, and Web page categorization. There are many researches in documents classification for English language, but a few research in Arabic language, while there are large community in the world that uses this language. This paper analyzes the effect of preprocessing, such as tokenization, normalization and removing stop words, stemming using Khoja stemmer, stemming using light stemmer, stemming using Khoja stemmer with tokenization, normalization and removing stop words, and stemming using light stemmer with tokenization, normalization and removing stop words on documents classification. This study uses three classification algorithms, Naïve Bayes (NB), K-Nearest Neighbor (KNN), and Support Vector Machine (SVM), it is applied on online Arabic corpus prepared by Diab Abuaiadh. Rapid Miner tool is used to apply the three classification models. Whereas the condition of the documents before preprocessing is compared with their condition after preprocessing to determine the extent of the effect of preprocessing on documents classification. the results demonstrate variation between document classification before preprocessing and after preprocessing, and difference between the three algorithms in terms of Accuracy, Precision, Recall, and F1-Score, whereas it will be discussed later.

Last modified: 2024-12-13 14:41:53