AN EFFICIENT FEATURE EXTRACTION WITH SUBSET SELECTION MODEL USING MACHINE LEARNING TECHNIQUES FOR TAMIL DOCUMENTS CLASSIFICATION

Journal: International Journal of Advanced Research in Engineering and Technology (IJARET) (Vol.11, No. 11)

Publication Date: 2020-11-30

Authors : N. Rajkumar T. S. Subashini K. Rajan V. Ramalingam;

Page : 66-81

Keywords : Tamil Document classification; Machine learning; TF-IDF; Chi-Square;

Source : Download Find it from : Google Scholar

Abstract

In the present days, the development of the internet has resulted in a significant rise in the number of electronic documents in several regional languages. As Tamil Text data in digital format both in online and offline mode is growing significantly nowadays, management and retrieval of the documents is a tedious process. Automatic text classification aims to allocate fixed class labels to unclassified text documents. Many natural language processing (NLP) techniques areextremelydependenton the automatic classification of Tamil Text documents. The current development of machine learning (ML) algorithms helps to attain effective Tamil document classification. In this view, this paper introduces an automated Tamil document classification technique using ML models. The presented model involves different processes such as preprocessing, feature extraction, feature selection, and classification. The proposed model uses term frequency-inverse document frequency (TF-IDF) approach for the feature extraction process. Besides, the Chi-square test is employed to select an optimal set of features. At last, three ML models such as random forest (RF), decision tree (DT), and gradient boosting tree (GBT) are applied to determine the class labels of the Tamil documents. To assess the performance of the presented model, a set of simulations takes place on a Tamil document dataset collected on our own. The experimental values ensured the effective classifier results of the presented model over the compared methods. From the experimental values, it is ensured that the GBT model has reached an effective classification outcome with the maximum accuracy of 85.10%, precision of 87.01%, recall of 85.10%, and F1-score of 85.52%.

Main Menu

Searching By

PARTNERS

AN EFFICIENT FEATURE EXTRACTION WITH SUBSET SELECTION MODEL USING MACHINE LEARNING TECHNIQUES FOR TAMIL DOCUMENTS CLASSIFICATION

Abstract

Advertisement