The Main Corpus

Features of the corpus

The main corpus, which includes texts representing standard Russian, can be subdivided into two parts, each of which has its distinguishing features: modern written texts (from the 1950s to the present day) and earlier texts (from the middle of the 18th to the middle of the 20th century). By default, the search is carried out in both sub-groups. It is possible to choose one of them and add search parameters on the “customize your corpus” page.

Every text included in the main corpus is subject to meta tagging and morphological tagging. Morphological tagging is carried out by software for automated morphological analysis. In a small part of the main corpus (currently around 6 million tokens; this figure is set to increase with time) homonymy is disambiguated by hand and the results of automated morphological analysis corrected if needed. This part, called disambiguated subcorpus, serves as a testing ground for a variety of search algorithms and programs of morphological analysis and natural language processing. It can also be used for research in modern Russian morphology that requires particular accuracy. Examples found in this subcorpus are annotated as “disambiguated” (“омонимия снята”). 

Modern written texts

The representative corpus of morphologically tagged modern texts is core of the Main corpus. It includes various types of texts representing modern standard (written) Russian:

  • Modern fiction of various genres
  • Modern drama
  • Memoirs and biographies
  • Journalism and literary criticism
  • Academic, popular science and educational texts
  • Religious and philosophical texts
  • Technical texts
  • Business and law texts
  • Day-to-day life texts, including texts not intended for publication (letters, diaries, etc.)

Texts are represented in proportion to their share in real-life usage. For example, the share of fiction (including drama and memoirs) does not exceed 40%.

The sources of book, magazine and newspaper texts included in the Corpus are usually proof-read electronic versions supplied by their respective publishers and the texts are used with publishers' permission.

The search can be limited to modern texts in the Date of creation field of the Customize your corpus page.

Mid-18th to mid-20th century texts

Texts from the middle of the 18th century to the middle of the 20th century included in the Corpus also represent various genres (fiction, scientific texts, journalism, letters) but due to limited availability of such texts in electronic form or in modern reprints the proportion of fiction for this period is much higher than for the main corpus. Pre-1918 and émigré texts are given in both in historical and modern spelling. Peculiarities of their original orthography preserved in modern academic re-editions are also preserved in the Corpus.

Annotation


All texts included in the main corpus are provided with metadata, morphological, morphemic, syntactic, and semantic annotations.

The metatextual annotation of the main corpus includes information about the title of the text, the date when it was written, the name, year of birth and gender of the author (if known), the place and date of publication, the source from which the text is given, its functional sphere, the genre and type of text, the chronotope of fiction or memoirs, the specificity of the audience (such as age), orthography and the type of morphological annotation. It is possible to select a subcorpus according to all these parameters.

Morphological tagging for Russian is carried out using special programs for automatic part-of-speech and grammatical analysis and lemmatization. Most of the texts are processed in parallel using two systems, MyStem and Rubic, specially adapted to work with texts of different fields, genres, and time of creation.

The Russian-language Mystem model is based on an electronic grammar dictionary and can generate hypotheses for out-of-vocabulary words. Additionally, special word-lemma mapping lists and rules were compiled by experts to improve annotations for obsolete inflection forms, colloquial variants and some other word forms that occur frequently in the corpus. Each word form is assigned as many annotations as provided by the system, regardless of the context – use  “All tags” mode to search through the MyStem annotations.

The neural network model Rubic is trained on a representative set of texts annotated manually. It predicts only one morphological analysis per word form that is most likely in context (see “Preferred tags” search mode). Rubic also automatically builds rules to map word forms to lemmas and uses a dictionary compiled by experts as a filter to choose the best match from several hypotheses predicted by the network. If there is no such match, then the lemma predicted as most likely in a given context is assigned to the word form. Additional rules are used to correct erroneous analyses for combinations of a lemma and a part of speech that occur 40 times and more in the corpus. This, in particular, makes it possible to provide correct analyzes for frequent archaic, colloquial and orthographically distorted forms.

Texts written in the old orthography are automatically analyzed by both systems; lemmas in such texts are given in the new orthography.

In a small subcorpus of the main corpus (6 million words), manual morphological disambiguation was carried out. Texts were annotated by the taggers DiaLing/AOT and MyStem, after which a team of experts selected the correct combination of part-of-speech and grammatical tags and lemma taking into account the word context and additionally improved the analysis.

The Rubic neural network model also performs the dependency parsing of the corpus, predicting one syntactic analysis per sentence. The model builds a dependency tree for a sentence, in which each word is connected by an edge with its syntactic head, except for the root of the tree – the main word of the sentence (usually it is a verb predicate). The edges are labeled with names of syntactic relations. Based on this dependency tree, additional heuristics retrieve the word spans corresponding to constituencies – clauses and groups (for example, main and subordinate clauses, noun phrase). At the moment, it is possible to search for the syntactic relation labels assigned to dependent words as well as basic constituency types. In the future, we plan to add a full-fledged syntactic search module to the corpus engine.

The main corpus is annotated token-by-token with regard to the words' morphemic structures. The annotation is made on the basis of a morphemic dictionary and the neural network mechanism NeuroRNC. It is possible to search for morphemes and their individual types (roots, prefixes, suffixes, inflexions), taking into account the alternations.

The main corpus features also automatic semantic annotation based on a set of discrete semantic characteristics attributed in the dictionary.

Publications

Check out the list of scientific publications on the Main corpus via the link: https://ruscorpora.ru/s/bYxnW. To find other types of publications related to the corpus, use the filters in the "Publications" section.

Updated on 22.07.2024