Analyzing the Impact of Preprocessing Techniques on Arabic Document Classification: Comparative Study
Journal: International Journal of Advanced Trends in Computer Science and Engineering (IJATCSE) (Vol.13, No. 6)Publication Date: 2024-12-15
Authors : Mahmoud Moshref Khalid Khalis Ibrahim Bassam Hammo Derar Eleyan;
Page : 228-240
Keywords : Natural Language Processing; Machine Learning; Preprocessing; Document Classification; Naïve Bayes; KNN; SVM; Tokenization; Normalization; Stop Words; Stemming.;
Abstract
Texts classification is an important field that can be used in data mining, information retrieval, machine learning. Documents classification now widely used in different domains, such as mail spam filtering, article indexing, Web searching, and Web page categorization. There are many researches in documents classification for English language, but a few research in Arabic language, while there are large community in the world that uses this language. This paper analyzes the effect of preprocessing, such as tokenization, normalization and removing stop words, stemming using Khoja stemmer, stemming using light stemmer, stemming using Khoja stemmer with tokenization, normalization and removing stop words, and stemming using light stemmer with tokenization, normalization and removing stop words on documents classification. This study uses three classification algorithms, Naïve Bayes (NB), K-Nearest Neighbor (KNN), and Support Vector Machine (SVM), it is applied on online Arabic corpus prepared by Diab Abuaiadh. Rapid Miner tool is used to apply the three classification models. Whereas the condition of the documents before preprocessing is compared with their condition after preprocessing to determine the extent of the effect of preprocessing on documents classification. the results demonstrate variation between document classification before preprocessing and after preprocessing, and difference between the three algorithms in terms of Accuracy, Precision, Recall, and F1-Score, whereas it will be discussed later.
Other Latest Articles
- A Framework For the Detection of Malicious Activities on Edge Computing Using Random Forest Classifier and Recurrent Neural Network
- Software Development Pipeline Based on DevOps for Software Development Teams in Tertiary Institutions
- A Review on Optimizing Organic Waste Management and Income Generation through Vermicomposting and AI-Powered Vermicomposting: Insights from Guwahati, Assam
- Students’ Attitudes toward Learning English among Senior High School Stem Students in Siocon District, Division of Zamboanga Del Norte: Basis for Proposed Intervention Program
- A Socio-Economic Study of Hansawas Khurd Village of Charkhi Dadri District
Last modified: 2024-12-13 14:41:53