Designing a tagset for annotating the Tuvan National Corpus
Journal: International Journal of Language Studies (Vol.6, No. 4)Publication Date: 2012-10-01
Authors : Aziyana BAYYR-OOL; Vitaly VOINOV;
Page : 1-24
Keywords : Corpus Annotation; Tagset Design; Tuvan; Turkic;
Abstract
This paper examines various aspects of designing a part-of-speech (POS) tagset for annotating a textual corpus in the Tuvan language of Siberia (Turkic family). The issues raised are relevant by extension to designing tagsets in other languages. Preliminary issues discussed are Tuvan linguistic structure, the rationale for preferring a POS tagset at initial stages of corpus design, the metalanguage and orthography of the tagset, and the potential usefulness of existing tagsets for designing a new tagset. The paper then presents the specific linguistic attributes that are encoded in the Tuvan tagset, using the three-level model of Major Class > Subclass > Features. Difficulties involved in deciding whether a specific type of word is a major class or a subclass are illustrated with Tuvan language data. The actual structure of the individual tags to be used in the tagset is also discussed, examining several existing models that differ in terms of transparency and level of linguistic detail. Sample Tuvan words that have been tagged using the system laid out in the paper are provided to illustrate how this tagset design facilitates searching for decomposable morphosyntactic elements relevant to the grammatical structure of Tuvan (as well as that of other Turkic languages.)
Other Latest Articles
- Book review
- Phenomenology of speech in a cold place: The Polar Eskimo language as “lived experience”
- A manifestation of the bilingual disadvantage in college-level writing
- Conflicts between prioritizing medical care and profit-making for a Thai hospital: A critical discourse analysis research
- Power distance reduction and positive reinforcement: EFL learners’ confidence and linguistic identity
Last modified: 2014-01-27 17:06:52