Derivational / morphemic structure
The Main corpus of the RNC is available for searching by derivational / morphemic structure of a word.
In the Word at a glance service, the morphemic structure of each word is visualized: prefixes, roots, suffixes and endings are highlighted using the geometrical signs adopted in the school Russian language teaching. This analysis is available in the Main and Educational corpora.
The structures are given only for the lexemes in their dictionary form (за-щищ-а-ть-ся) rather than for inflected word forms like защищающимися.
Dictionary-based and automatic annotation
The word structure annotation in the Main corpus is based on the morphemic dictionary specially developed for the corpus, which provides analyses for 75,000 lemmas (310,000 non-unique morphemes) as of May 2023. The markup of morphemes in the Educational corpus is based on Alexander Tikhonov's Morpheme-Orthographic Dictionary (2002), featuring about 100 thousand lexemes. In the Educational corpus, only content words and common names are annotated, whereas in the Main corpus morphemic structures for function words and proper names are also provided. For each word a list of morphemes, their type (prefix, root, interfix, suffix, ending or postfix) and their linear position in the word are given.
In the Word at a glance service within the Main corpus, automatical annotation is added for the lemmas that are absent in the morphemic dictionary, including some fairly frequent lexical items. For example, the word гарантировать is not included into the morphemic dictionary, so its structure (гарант-ирова-ть) is predicted by the algorithm. Automatic morphemic parsing is generated by an algorithm based on an ensemble of convolutional neural networks. The architecture is based on the model propesed by A. Sorokin and A. Kravtsova. Such analyses are tagged by a special attribute "generated by NeuroRNC". In the current version of the Educational corpus, the morphemic structure is determined only for the words included into Tikhonov's dictionary. However, it is planned to mark the morphemic structure in all the content words using a neural network algorithm.
Errors of automatic annotation are always possible. Please report errors using the "Rate" button. Note that the morphemic structuring of words in the Main corpus may differ from what you are accustomed to (see "Annotation principles").
The word-by-word morphemic annotation searchable within the Main corpus does not use the neural network mechanism so far. It also uses an earlier version of the morphemic dictionary.
Principles of annotation
The morphemic dictionary of the Main corpus owes much to the ideology of the Morphemic dictionary of Russian by Ariadna Kuznetsova and Tatiana Efremova [А. И. Кузнецова и Т. Ф. Ефремова. Словарь морфем русского языка. М., 1986]. Their method is that of strong albeit not maximal splitting of morphemes and correspondences to other words with similar structure. The morphemic analysis in the corpus does not always coincide with the one accepted in the Russian school education. For example the word улыбаться features the -лыб- root, as the structure is parallel to the other verbs with у- (cf. у-смех-а-ть-ся), some words are analyzed etymologically (на-сек-ом-ое, вос-точ-н-ый). The borrowings are split into morphemes (eg. ре-волюц-и-я, квит-анци-я) if they have semantic parallels to other borrowings with a comparable structure (cf. э-волюц-и-я, рас-квит-а-ть-ся). Morphemic structure are given also for functional words, proper names and words derived thereof.
In the Educational corpus, a more traditional, school-oriented approach is taken. The basic distionary here is Alexander Tikhonov's Morpheme-Orthographic Dictionary (2002). Morphemes are splitted in larger chunks (улыб-а-ть-ся, насеком-ое, восточ-н-ый), especially borrowings (революци-я, квитанци-я). Morphemic structures only for content words are given, excluding proper names.
Search
The morphemic search is supported only in the Main corpus. It is specified in the Morphemic structure section within the Lemmas and tags search. By default this feature is not active but it may be added via the Add condition button.
The user may specify one or multiple parameters: the morpheme itself, the type of the morpheme and its linear position. The root бав on the third position yields words like вдобавок or позабавить.
The parameter Including alternants can be specified that searches all the allomorphes of a given morpheme. The root -ук- without this parameter can be found only in the word наука, whereas with alternants it yields учить, ученый etc. as well.
Developers
The initial framework of the morphemic dictionary for the Main corpus was developed by Elena Grishina, Ilya Itkin, Olga Lyashevskaya and Maria Tagabileva. Later the dictionary was updated by Olga Lyashevskaya, Egor Kashkin, and Dmitri Sitchinava.
The algorithm of neural network analysis was developed by Dmitry Morozov and Timur Garipov after the architecture proposed by Andrey Sorokin.
We are thankful to Marina Litvinova for her expertise in developing morpheme analysis for the Educational corpus.
References
Е. Гришина, И. Иткин, О. Ляшевская, М. Тагабилева. О задачах и методах словообразовательной разметки в корпусе текстов // Полярный вестник (Тромсё), 2009, № 12, с. 5–25
Sorokin, A., Kravtsova, A. Deep Convolutional Networks for Supervised Morpheme Segmentation of Russian Language. In: Ustalov, D., Filchenkov, A., Pivovarova, L., Žižka, J. (eds) Artificial Intelligence and Natural Language. AINL 2018. Communications in Computer and Information Science, vol 930. Springer, Cham. https://doi.org/10.1007/978-3-030-01204-5_1
T. Garipov, D. Morozov and A. Glazkova, "Generalization Ability of CNN-Based Morpheme Segmentation," 2023 Ivannikov Ispras Open Conference (ISPRAS), Moscow, Russian Federation, 2023, pp. 58-62, doi: 10.1109/ISPRAS60948.2023.10508171