Derivational / morphemic structure

The Main corpus of the RNC is available for searching by derivational / morphemic structure of a word.

In the Word at a glance service, the morphemic structure of each word is visualized: prefixes, roots, suffixes and endings are highlighted using the geometrical signs adopted in the school Russian language teaching. This analysis is available in the Main and Educational corpora.

The structures are given only for the lexemes in their dictionary form (за-щищ-а-ть-ся) rather than for inflected word forms like защищающимися.

Dictionary-based and automatic annotation

The word structure annotation in the Main corpus is based on the morphemic dictionary specially developed for the corpus, which provides analyses for 75,000 lemmas (310,000 non-unique morphemes) as of May 2023. The markup of morphemes in the Educational corpus is based on Alexander Tikhonov's Morpheme-Orthographic Dictionary (2002), featuring about 100 thousand lexemes. In the Educational corpus, only content words and common names are annotated, whereas in the Main corpus morphemic structures for function words and proper names are also provided. For each word a list of morphemes, their type (prefix, root, interfix, suffix, ending or postfix) and their linear position in the word are given.

In the Word at a glance service within the Main corpus, automatical annotation is added for the lemmas that are absent in the morphemic dictionary, including some fairly frequent lexical items. For example, the word эстетика is not included into the morphemic dictionary, so its structure (эстет-ик-а) is predicted by the algorithm. Automatic morphemic parsing is generated by an algorithm based on the RuRoBERTa-large model, fine-tuned for the task of morphemic segmentation. The model was developed using an architecture created by RNC staff members. The proportion of fully correct analyses exceeds 93.5%. Automatic analyses are marked with a special tag "generated by NeuroRNC". In the current version of the Educational corpus, the morphemic structure is determined only for the words included into Tikhonov's dictionary. However, it is planned to mark the morphemic structure in all the content words using a neural network algorithm.

Errors of automatic annotation are always possible. Please report errors using the "Rate" button. Note that the morphemic structuring of words in the Main corpus may differ from what you are accustomed to (see "Annotation principles").

The word-by-word morphemic annotation searchable within the Main corpus does not use the neural network mechanism so far. It also uses an earlier version of the morphemic dictionary.

Principles of annotation

The morphemic dictionary of the Main corpus owes much to the ideology of the Morphemic dictionary of Russian by Ariadna Kuznetsova and Tatiana Efremova [А. И. Кузнецова и Т. Ф. Ефремова. Словарь морфем русского языка. М., 1986]. Their method is that of strong albeit not maximal splitting of morphemes and correspondences to other words with similar structure. The morphemic analysis in the corpus does not always coincide with the one accepted in the Russian school education. For example the word улыбаться features the -лыб- root, as the structure is parallel to the other verbs with у- (cf. у-смех-а-ть-ся), some words are analyzed etymologically (на-сек-ом-ое, вос-точ-н-ый). The borrowings are split into morphemes (eg. ре-волюц-и-я, квит-анци-я) if they have semantic parallels to other borrowings with a comparable structure (cf. э-волюц-и-я, рас-квит-а-ть-ся). Morphemic structure are given also for functional words, proper names and words derived thereof.

In the Educational corpus, a more traditional, school-oriented approach is taken. The basic distionary here is Alexander Tikhonov's Morpheme-Orthographic Dictionary (2002). Morphemes are splitted in larger chunks (улыб-а-ть-ся, насеком-ое, восточ-н-ый), especially borrowings (революци-я, квитанци-я). Morphemic structures only for content words are given, excluding proper names.

The morphemic search is supported in the Main and Educational corpora. It is specified in the Morphemic structure section within the Lemmas and tags search. By default this feature is not active but it may be added via the Add condition button.

The user may specify one or multiple parameters: the morpheme itself, the type of the morpheme and its linear position. The root бав on the third position yields words like вдобавок or позабавить.

The parameter Including alternants can be specified that searches all the allomorphes of a given morpheme. The root -ук- without this parameter can be found only in the word наука, whereas with alternants it yields учить, ученый etc. as well.

Developers

The initial framework of the morphemic dictionary for the Main corpus was developed by E. A. Grishina, I. B. Itkin, O. N. Lyashevskaya, and M. G. Tagabileva; the dictionary was subsequently refined by O. N. Lyashevskaya, E. V. Kashkin, and D. V. Sichinava. The neural network algorithm for analyzing out-of-vocabulary words was developed by D. A. Morozov and T. A. Garipov in collaboration with A. V. Glazkova.

We are thankful to Marina Litvinova for her expertise in developing morpheme analysis for the Educational corpus.

References

Е. Гришина, И. Иткин, О. Ляшевская, М. Тагабилева. О задачах и методах словообразовательной разметки в корпусе текстов // Полярный вестник (Тромсё), 2009, № 12, с. 5–25

Morozov D., Garipov T., Lyashevskaya O., Savchuk S., Iomdin B., & Glazkova A. (2024). Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts? Journal of Language and Education, 10(4), 71-84. https://doi.org/10.17323/jle.2024.22237

Dmitry Morozov, Lizaveta Astapenka, Anna Glazkova, Timur Garipov, and Olga Lyashevskaya. 2025. BERT-like Models for Slavic Morpheme Segmentation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6795–6815, Vienna, Austria. Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.acl-long.337

Updated on 02.03.2026