DISTRIBUTIVE DICTIONARY OF THE HISTORICAL CORPUS “MANUSCRIPT”: PROBLEM STATEMENT, MATERIAL, METHODS
Journal: Current Issues in Philology and Pedagogical Linguistics (Vol.-, No. 2)Publication Date: 2022-06-25
Authors : Baranov V.A.;
Page : 94-106
Keywords : historical corpus of Slavonic manuscripts of the 10th–15th centuries; lexical distribution; online corpus manager.;
Abstract
Characteristics of linguistic materials and methods used to create an electronic distributive dictionary based on the historical corpus “Manuscript” (http://manuscripts.ru/mns/mns_evp.vec.main ), containing marked–up machine-readable transcriptions of extant Slavonic manuscripts and excerpts of the X-XV centuries, are given. The conditions for the use of statistical methods for the distributive analysis of the words of ancient Slavonic texts are discussed, the requirements for specialized tools and demonstration of the forms of visualization of the prototype of the dictionary are formulated. Examples of methods of automatic extraction of words with similar lexical environment from a large array of text data are given. The procedures and tools for preparing linguistic data are described (in particular, the formation of subcorps based on metadata and the methods implemented in the n-gram module for extracting the most frequent combinations of linguistic units from the corpus), the use of the k-skip-n-gram method for calculating word vectors and the method of finding the cosine distance between vectors is justified. The parameters of the dictionary query form are demonstrated, it allows specifying the type of the analyzed linguistic unit (lemma or text precedent), its mask and the cosine distance threshold. An example of a sample for the lemma лѣто ‘summer' is given, it includes a list of words that have the closest contextual compatibility, the values of the cosine distances of the analyzed and the words close in distribution, as well as a list of words that occur next to the analyzed and the searched words. A sampling graph is shown, it demonstrates not only semantic, thematic, associative analogues of the word лѣто ‘summer', but also some groups of associates. The analysis of the material, methods and results allowed us to draw conclusions about the need to use statistical measures when assessing the proximity of their components for the formation of vectors and about some other conditions for preprocessing linguistic material.
Other Latest Articles
- Financial Effects of Dissolving the Marriage on the Alimony of the Wife
- Study of Variables Affecting the Relations between Iran and Turkey
- LANGUAGE MEDIATION IN RUSSIAN ACADEME: TRANSLATION AND ME-THODOLOGICAL ASPECTS
- The Trio of the United States, Russia and Ukraine in the Structure of the International System in 2022
- ON DEFINING THE NOTION OF PROTOTERM
Last modified: 2022-06-27 18:20:36