ResearchBib Share Your Research, Maximize Your Social Impacts
Sign for Notice Everyday Sign up >> Login

Terms extraction from texts of scientific papers

Journal: Software & Systems (Vol.35, No. 4)

Publication Date:

Authors : ; ; ;

Page : 689-697

Keywords : terms dictionary; rubert; mbert; language model; machine learning; nlp; terminology extraction;

Source : Downloadexternal Find it from : Google Scholarexternal

Abstract

The relevance of the task of extracting terms from the texts of scientific articles is due to the need for automatic annotation and extracting keywords in an ever-increasing flow of scientific and technical documents. This paper explores the influence of various language models on the quality of extracting scientific terms from Russian texts. We compare two models: the mBERT model that was pretrained on texts of different languages, and the ruBERT model pretrained only on Russian data. Two training sets of annotated texts were prepared. The authors carried out fine-tuning and further comparison of the performance indicators of the two models using these training sets. They also studied the influence of the choice of the language model on the quality of extracting the terminology contained in the texts of scientific articles. The results have become the base for modernizing the algorithm for extracting terminology from texts applied by the Terminator tool, developed at the A.P. Ershov Institute of Informatics Systems. The obtained results showed that within the framework of the task of extracting terminology from the texts of Russian scientific articles, the ruBERT model, which gave the best performance in an ensemble with a dictionary and heuristics, can be considered as the most applicable model. In addition, the difference in the results of models on full and partial match can be stated due to the problem of defining the boundaries of terms in the texts described in the paper. The results obtained also allow concluding that the quality of the training set markup affects the quality of terminology extraction.

Last modified: 2023-08-03 19:04:51