Dictionary-based and automatic annotation
The word structure annotation in the Main corpus is based on the morphemic dictionary specially developed for the corpus, which provides analyses for 75,000 lemmas (310,000 non-unique morphemes) as of May 2023. The markup of morphemes in the Educational corpus is based on Alexander Tikhonov's Morpheme-Orthographic Dictionary (2002), featuring about 100 thousand lexemes. In the Educational corpus, only content words and common names are annotated, whereas in the Main corpus morphemic structures for function words and proper names are also provided. For each word a list of morphemes, their type (prefix, root, interfix, suffix, ending or postfix) and their linear position in the word are given.
In the Word at a glance service within the Main corpus, automatical annotation is added for the lemmas that are absent in the morphemic dictionary, including some fairly frequent lexical items. For example, the word гарантировать is not included into the morphemic dictionary, so its structure (гарант-ирова-ть) is predicted by the algorithm. Automatic morphemic parsing is generated by a neural network algorithm with a convolutional neural network as a core, proposed by Alexey Sorokin. Such analyses are tagged by a special attribute "generated by NeuroRNC". In the current version of the Educational corpus, the morphemic structure is determined only for the words included into Tikhonov's dictionary. However, it is planned to mark the morphemic structure in all the content words using a neural network algorithm.
Errors of automatic annotation are always possible. Please report errors using the "Rate" button. Note that the morphemic structuring of words in the Main corpus may differ from what you are accustomed to (see "Annotation principles").
The word-by-word morphemic annotation searchable within the Main corpus does not use the neural network mechanism so far. It also uses an earlier version of the morphemic dictionary.