Identification and extraction of multiword expressions from Hindi & Urdu language in natural language processing

Journal: International Journal of Advanced Technology and Engineering Exploration (IJATEE) (Vol.9, No. 91)

Publication Date: 2022-06-30

Authors : Vaishali Gupta; Nisheeth Joshi;

Page : 807-826

Keywords : Bigrams; Tags; Multiword expression (MWE); Conditional random field (CRF); Confusion matrix.;

Source : Download Find it from : Google Scholar

Abstract

Text can be translated from one language to another using statistical machine translation, but there are still gaps in the translations because of a lack of language resource material. Building a linguistic corpus necessarily requires the extraction of multiword expressions (MWE). MWE is a collection of words with idiomatic expression properties. However, due to its non-compositional meaning of distinctive words, identifying and extracting MWE is a time-consuming task. In this case, an automated system has been developed for the extraction of MWEs from Hindi and Urdu language sources automatically. The entire process includes tagging, pattern matching, an identification algorithm, and the extraction of MWEs from the data. Tagging each word with a unique part of speech tag is used as an input to the pattern-matching algorithm. Using pattern matching, MWE tags of specific patterns were selected, and the algorithm for automatic MWE detection was built on top of that. The conditional random field (CRF++) model was used to automatically extract the MWEs from data. Confusion matrix was used to conduct the automated evaluation of this proposed system. For Hindi and Urdu, the calculated overall accuracy is 96.82% and 96.62%, respectively.

Main Menu

Searching By

PARTNERS

Identification and extraction of multiword expressions from Hindi & Urdu language in natural language processing

Abstract

Advertisement