ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

Identification and extraction of multiword expressions from Hindi & Urdu language in natural language processing

Journal: International Journal of Advanced Technology and Engineering Exploration (IJATEE) (Vol.9, No. 91)

Publication Date:

Authors : ; ;

Page : 807-826

Keywords : Bigrams; Tags; Multiword expression (MWE); Conditional random field (CRF); Confusion matrix.;

Source : Downloadexternal Find it from : Google Scholarexternal

Abstract

Text can be translated from one language to another using statistical machine translation, but there are still gaps in the translations because of a lack of language resource material. Building a linguistic corpus necessarily requires the extraction of multiword expressions (MWE). MWE is a collection of words with idiomatic expression properties. However, due to its non-compositional meaning of distinctive words, identifying and extracting MWE is a time-consuming task. In this case, an automated system has been developed for the extraction of MWEs from Hindi and Urdu language sources automatically. The entire process includes tagging, pattern matching, an identification algorithm, and the extraction of MWEs from the data. Tagging each word with a unique part of speech tag is used as an input to the pattern-matching algorithm. Using pattern matching, MWE tags of specific patterns were selected, and the algorithm for automatic MWE detection was built on top of that. The conditional random field (CRF++) model was used to automatically extract the MWEs from data. Confusion matrix was used to conduct the automated evaluation of this proposed system. For Hindi and Urdu, the calculated overall accuracy is 96.82% and 96.62%, respectively.

Last modified: 2022-08-08 17:44:41