AN EFFICIENT FEATURE EXTRACTION WITH SUBSET SELECTION MODEL USING MACHINE LEARNING TECHNIQUES FOR TAMIL DOCUMENTS CLASSIFICATION
Journal: International Journal of Advanced Research in Engineering and Technology (IJARET) (Vol.11, No. 11)Publication Date: 2020-11-30
Authors : N. Rajkumar T. S. Subashini K. Rajan V. Ramalingam;
Page : 66-81
Keywords : Tamil Document classification; Machine learning; TF-IDF; Chi-Square;
Abstract
In the present days, the development of the internet has resulted in a significant rise in the number of electronic documents in several regional languages. As Tamil Text data in digital format both in online and offline mode is growing significantly nowadays, management and retrieval of the documents is a tedious process. Automatic text classification aims to allocate fixed class labels to unclassified text documents. Many natural language processing (NLP) techniques areextremelydependenton the automatic classification of Tamil Text documents. The current development of machine learning (ML) algorithms helps to attain effective Tamil document classification. In this view, this paper introduces an automated Tamil document classification technique using ML models. The presented model involves different processes such as preprocessing, feature extraction, feature selection, and classification. The proposed model uses term frequency-inverse document frequency (TF-IDF) approach for the feature extraction process. Besides, the Chi-square test is employed to select an optimal set of features. At last, three ML models such as random forest (RF), decision tree (DT), and gradient boosting tree (GBT) are applied to determine the class labels of the Tamil documents. To assess the performance of the presented model, a set of simulations takes place on a Tamil document dataset collected on our own. The experimental values ensured the effective classifier results of the presented model over the compared methods. From the experimental values, it is ensured that the GBT model has reached an effective classification outcome with the maximum accuracy of 85.10%, precision of 87.01%, recall of 85.10%, and F1-score of 85.52%.
Other Latest Articles
- DESIGN AND ANALYSIS OF LOW POWER HIGH PERFORMANCE 64 BIT TCAM ARCHITECTURES
- IMPROVED WHALE OPTIMIZATION ALGORITHM BASED FEATURE SELECTION WITH FUZZY RULE BASE CLASSIFIER FOR AUTISM SPECTRUM DISORDER DIAGNOSIS
- PIGEON INSPIRED OPTIMIZATION WITH DEEP BELIEF NETWORK FOR THYROID DISEASE DIAGNOSIS AND CLASSIFICATION
- ASSESSMENT OF PHYSICO-CHEMICAL PARAMETERS OF SELECTED WETLANDS OF JAMMU DISTRICT WITH REGARD TO AQUATIC ORGANISMS
- A SURVEY ON MACHINE LEARNING APPLICATIONS TO TACKLE COVID-19 VIRAL PANDEMIC
Last modified: 2021-02-22 16:11:40