Review of feature selection methods for text classification
Journal: International Journal of Advanced Computer Research (IJACR) (Vol.10, No. 49)Publication Date: 2020-07-27
Authors : Muhammad Iqbal Malik Muneeb Abid Muhammad Noman Khalid; Amir Manzoor;
Page : 138-152
Keywords : Feature selection; Binary classification; Feature selection algorithms.;
Abstract
For the last three decades, the World Wide Web (WWW) has become one of the most widely used podium to generate an immense amount of heterogeneous data in a single day. Presently, many organizations aimed to process their domain data for taking quick decisions to improve their organizational performance. However, high dimensionality in datasets is a biggest obstacle for researchers and domain engineers to achieve their desired performance through their selected machine learning (ML) algorithms. In ML, feature selection is a core concept used for selecting most relevant features of high dimension data and thus improve the performance of the trained learning model. Moreover, the feature selection process also provides an effective way by eliminating in appropriate and redundant features and ultimately shrinks the computational time. Due to the significance and applications of feature selection, it has become a well-researched area of ML. Nowadays, feature selection has a vital role in most of the effective spam detection systems, pattern recognition systems, automated organization, management of documents, and information retrieval systems. In order to do accurate classification, the relevant feature selection is the most important task, and to achieve its objectives, this study starts with an overview of text classification. This overview is then followed by a survey. The survey covered the popular feature selection methods commonly used for text classification. This survey also sheds light on applications of feature selection methods. The focus of this study is three feature selection algorithms, i.e., Principal Component Analysis (PCA), Chi-Square (CS) and Information Gain (IG). This study is helpful for researchers looking for some suitable criterion to decide the suitable technique to be used for better understanding of the performance of the classifier. In order to conduct experiments, web spam uk2007 dataset is considered. Ten, twenty, thirty, and forty features were selected as an optimal subset from web spam uk2007 dataset. Among all three feature selection algorithms, CS and IG had highest F1Score (F-measure =0.911) but at the same time suffered with model building time.
Other Latest Articles
- BARRIERS TO ACCESS MENSTRUAL HYGIENE IN RURAL INDIA
- UNDERSTANDING THE ROLE OF ELITE BEHIND THE UNREST OF 1893-94 OF ASSAM: A RETROSPECTIVE DISCOURSE
- The Application of Balanced Score Card (BSC) in the Performance Management of Enterprises
- Efficiency of Perturb and Observe MPPT for Pv System with Boost Converter
- A DISCUSSION OF THE RELATIONSHIP MODEL OF THE PURCHASE INTENTION OF BRANDS, AS BASED ON FOOD SAFETY ISSUES
Last modified: 2020-08-05 20:32:58