Designing a tagset for annotating the Tuvan National Corpus

Journal: International Journal of Language Studies (Vol.6, No. 4)

Publication Date: 2012-10-01

Authors : Aziyana BAYYR-OOL; Vitaly VOINOV;

Page : 1-24

Keywords : Corpus Annotation; Tagset Design; Tuvan; Turkic;

Source : Download Find it from : Google Scholar

Abstract

This paper examines various aspects of designing a part-of-speech (POS) tagset for annotating a textual corpus in the Tuvan language of Siberia (Turkic family). The issues raised are relevant by extension to designing tagsets in other languages. Preliminary issues discussed are Tuvan linguistic structure, the rationale for preferring a POS tagset at initial stages of corpus design, the metalanguage and orthography of the tagset, and the potential usefulness of existing tagsets for designing a new tagset. The paper then presents the specific linguistic attributes that are encoded in the Tuvan tagset, using the three-level model of Major Class > Subclass > Features. Difficulties involved in deciding whether a specific type of word is a major class or a subclass are illustrated with Tuvan language data. The actual structure of the individual tags to be used in the tagset is also discussed, examining several existing models that differ in terms of transparency and level of linguistic detail. Sample Tuvan words that have been tagged using the system laid out in the paper are provided to illustrate how this tagset design facilitates searching for decomposable morphosyntactic elements relevant to the grammatical structure of Tuvan (as well as that of other Turkic languages.)

Main Menu

Searching By

PARTNERS

Designing a tagset for annotating the Tuvan National Corpus

Abstract

Advertisement