An Analysis of Sindhi Annotated Corpus using Supervised Machine Learning Methods
Journal: Mehran University Research Journal of Engineering and Technology (Vol.38, No. 1)Publication Date: 2019-01-01
Authors : Mazhar Ali; Asim Imdad Wagan;
Page : 185-196
Keywords : Machine Learning; Sindhi Corpus; Universal part of speech; Random Forest; Support Vector Machines; Natural Language Processing;
Abstract
The linguistic corpus of Sindhi language is significant for computational linguistics process, machine learning process, language features identification and analysis, semantic and sentiment analysis, information retrieval and so on. There is little computational linguistics work done on Sindhi text whereas, English, Arabic, Urdu and some other languages are fully resourced computationally. The grammar and morphemes of these languages are analyzed properly using dissimilar machine learning methods. The development and research work regarding computational linguistics are in progress on Sindhi language at this time. This study is planned to develop the Sindhi annotated corpus using universal POS (Part of Speech) tag set and Sindhi POS tag set for the purpose of language features and variation analysis. The features are extracted using TF-IDF (Term Frequency and Inverse Document Frequency) technique. The supervised machine learning model is developed to assess the annotated corpus to know the grammatical annotation of Sindhi language. The model is trained with 80% of annotated corpus and tested with 20% of test set. The cross-validation technique with 10-folds is utilized to evaluate and validate the model. The results of model show the better performance of model as well as confirm the proper annotation to Sindhi corpus. This study described a number of research gaps to work more on topic modeling, language variation, sentiment and semantic analysis of Sindhi language.
Other Latest Articles
- Effective Image Segmentation using Composite Energy Metric in Levelset Based Curve Evolution
- La enseñanza diseñada en los costos para desarrollar habilidades críticas y creativas. Nuevas tendencias en aspectos pedagógicos referidos a la enseñanza de la gestión y control de costes
- Wi-Fi Fingerprinting Based Room Level Indoor Localization Framework Using Ensemble Classifiers
- Crítica del costo unitario
- Ground-Water Quality in Islamkot and Mithi Talukas of District Tharparkar, Sindh, Pakistan
Last modified: 2019-02-01 02:39:08