All texts included in the main corpus are provided with metadata, morphological, morphemic, syntactic, and semantic annotations.
The metatextual annotation of the main corpus includes information about the title of the text, the date when it was written, the name, year of birth and gender of the author (if known), the place and date of publication, the source from which the text is given, its functional sphere, the genre and type of text, the chronotope of fiction or memoirs, the specificity of the audience (such as age), orthography and the type of morphological annotation. It is possible to select a subcorpus according to all these parameters.
Morphological tagging for Russian is carried out using special programs for automatic part-of-speech and grammatical analysis and lemmatization. Most of the texts are processed in parallel using two systems, MyStem and Rubic, specially adapted to work with texts of different fields, genres, and time of creation.
The Russian-language Mystem model is based on an electronic grammar dictionary and can generate hypotheses for out-of-vocabulary words. Additionally, special word-lemma mapping lists and rules were compiled by experts to improve annotations for obsolete inflection forms, colloquial variants and some other word forms that occur frequently in the corpus. Each word form is assigned as many annotations as provided by the system, regardless of the context – use “All tags” mode to search through the MyStem annotations.
The neural network model Rubic is trained on a representative set of texts annotated manually. It predicts only one morphological analysis per word form that is most likely in context (see “Preferred tags” search mode). Rubic also automatically builds rules to map word forms to lemmas and uses a dictionary compiled by experts as a filter to choose the best match from several hypotheses predicted by the network. If there is no such match, then the lemma predicted as most likely in a given context is assigned to the word form. Additional rules are used to correct erroneous analyses for combinations of a lemma and a part of speech that occur 40 times and more in the corpus. This, in particular, makes it possible to provide correct analyzes for frequent archaic, colloquial and orthographically distorted forms.
Texts written in the old orthography are automatically analyzed by both systems; lemmas in such texts are given in the new orthography.
In a small subcorpus of the main corpus (6 million words), manual morphological disambiguation was carried out. Texts were annotated by the taggers DiaLing/AOT and MyStem, after which a team of experts selected the correct combination of part-of-speech and grammatical tags and lemma taking into account the word context and additionally improved the analysis.
The Rubic neural network model also performs the dependency parsing of the corpus, predicting one syntactic analysis per sentence. The model builds a dependency tree for a sentence, in which each word is connected by an edge with its syntactic head, except for the root of the tree – the main word of the sentence (usually it is a verb predicate). The edges are labeled with names of syntactic relations. Based on this dependency tree, additional heuristics retrieve the word spans corresponding to constituencies – clauses and groups (for example, main and subordinate clauses, noun phrase). At the moment, it is possible to search for the syntactic relation labels assigned to dependent words as well as basic constituency types. In the future, we plan to add a full-fledged syntactic search module to the corpus engine.
The main corpus is annotated token-by-token with regard to the words' morphemic structures. The annotation is made on the basis of a morphemic dictionary and the neural network mechanism NeuroRNC. It is possible to search for morphemes and their individual types (roots, prefixes, suffixes, inflexions), taking into account the alternations.
The main corpus features also automatic semantic annotation based on a set of discrete semantic characteristics attributed in the dictionary.