Terms extraction from texts of scientific papers
Journal: Software & Systems (Vol.35, No. 4)Publication Date: 2022-12-16
Authors : Dementeva Ya.Yu.; Bruches E.P.; Batura T.V.;
Page : 689-697
Keywords : terms dictionary; rubert; mbert; language model; machine learning; nlp; terminology extraction;
Abstract
The relevance of the task of extracting terms from the texts of scientific articles is due to the need for automatic annotation and extracting keywords in an ever-increasing flow of scientific and technical documents. This paper explores the influence of various language models on the quality of extracting scientific terms from Russian texts. We compare two models: the mBERT model that was pretrained on texts of different languages, and the ruBERT model pretrained only on Russian data. Two training sets of annotated texts were prepared. The authors carried out fine-tuning and further comparison of the performance indicators of the two models using these training sets. They also studied the influence of the choice of the language model on the quality of extracting the terminology contained in the texts of scientific articles. The results have become the base for modernizing the algorithm for extracting terminology from texts applied by the Terminator tool, developed at the A.P. Ershov Institute of Informatics Systems. The obtained results showed that within the framework of the task of extracting terminology from the texts of Russian scientific articles, the ruBERT model, which gave the best performance in an ensemble with a dictionary and heuristics, can be considered as the most applicable model. In addition, the difference in the results of models on full and partial match can be stated due to the problem of defining the boundaries of terms in the texts described in the paper. The results obtained also allow concluding that the quality of the training set markup affects the quality of terminology extraction.
Other Latest Articles
- Semiotic network editing software for robot control systems
- Classification of common design patterns for multi-agent systems
- TOOL WEAR OF (AL, CR, W) N-COATINGS ON CEMENTED CARBIDE TOOLS PREPARED BY ARC ION PLATING IN DRY CUTTING OF SINTERED STEEL
- Determination of Hydrogeochemistry, Drinking and Irrigation Properties of Groundwaters in the Northwest Section of Afyon Plain
- A Morphometric Approach to Bozkurt (Kastamonu-Türkiye) Flood
Last modified: 2023-08-03 19:04:51