Learning Word and Sub-word Vectors for Amharic (Less Resourced Language)
Journal: International Journal of Advanced Engineering Research and Science (Vol.7, No. 8)Publication Date: 2020-08-09
Authors : Abebawu Eshetu Getenesh Teshome Tewodros Abebe;
Page : 359-366
Keywords : Amharic; word vectors; fasttext; word2vec.;
Abstract
The availability of pre-trained word embedding models (also known as word vectors) empowered many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these distributed word representations is the existence of large curated corpora to train them and use the pre-trained models in downstream tasks. In this paper, we describe how we trained such quality word representations for one of less-resourced Ethiopian languages, Amharic. We used several offline and online data sources and created 100, 200, and 300-dimension word2vec and FastText word vectors. We also introduced new word analogy dataset to evaluate word vectors for Amharic language. In addition, we created Amharic sentence piece model, which can be used to decode and encode words for subsequent NLP tasks. Using this SentencePiece model, we created Amharic sub-word word2vec embedding with 25, 50, 100, 200, and 300 dimensions trained over our large curated dataset. Finally, we evaluate our pre-trained word vectors on both intrinsic word analogy and extrinsic downstream natural language processing task. The result shows promising performance for both intrinsic and extrinsic evaluations as compared to previously released model.
Other Latest Articles
- KASHEFI HERAVI AND HIS FAMOUS TAFSIR TAFSIR HOSSEINI IN A FEW LINES
- MULTIPLE SHOOT INDUCTION FROM SHOOT TIP EXPLANTS OF ZIZYPHUS MAURITIANA. L
- Analysis of Project Acceleration with CPM Methodand PDM on Housing Development Project in Maluku 3 West Seram
- CHALLENGES FACED BY HEALTHCARE-RELATED SECTORS IN INDIA DURING COVID-19 PANDEMIC: A REVIEW
Last modified: 2020-09-09 18:47:09