Оffline versions of the RNC
All data published under https://www.ruscorpora.ru/ are available exclusively for non-commercial use for research and educational purposes (in accordance with Article 1274 of the Civil Code of the Russian Federation). They are not intended to be read or viewed, copied, or used in any other way as texts proper: they can only be used as sources of examples (citations) illustrating a particular linguistic phenomenon. When quoting examples obtained in the RNC, please refer to the RNC as the source of examples and include the name(s) of the author(s) of the text and its title.
To obtain an offline version of a corpus or a diachronic dataset from the RNC, the corresponding license agreement has to be signed, scanned and sent to firstname.lastname@example.org. Please indicate in the application the purpose for which you intend to use the data.
Offline disambiguated version of the RNC
It contains texts in modern Russian (written after 1950, most of the texts were written in the late 20th or early 21st centuries), one million words in total. Each word form is supplied with a set of manually verified morphological features which includes the lemma (initial form) and a set of grammatical characteristics. In some cases, several alternative interpretations are given. The corpus is balanced: fiction, academic and journalistic texts, transcripts of oral speech and blogs are represented in roughly equal proportion.
About 1 million words.
Offline version of the Deeply Annotated SynTagRus Corpus
The SynTagRus is a corpus of Russian texts supplied with several types of annotation. The corpus includes fiction, popular science and news articles. The texts are stored as XML files. The structure of each file is defined by a hierarchically ordered set of XML elements. The database includes about 1300 texts (almost 1.5 million words, over 100 000 sentences). The morphological annotation includes the lemma (initial form) and the grammatical characteristics of the word. The most important feature is the syntactic annotation of each sentence, represented as a tree structure with words as nodes and syntactic relations as edges. The lexical-functional annotation describes phrases in terms of lexical functions of the Meaning-Text linguistic model. There are about 40 000 sentences containing such phrases in the corpus, which amounts to approximately 28% of all sentences. The lexical-semantic annotation tackles polysemy: the senses of all occurrences of 3000 polysemantic words appearing in the corpus are distinguished. All texts contain meta-information (author, title, source, etc.) in accordance with the meta-annotation format of the main RNC corpus.
About 1.5 million words.
Diachronic datasets of the RNC
The diachronic datasets cover three periods (1700–1916, 1918–1991, and 1992–2016) and roughly correspond to three historical periods in the development of the society and the language of the Modern era (pre-Soviet, Soviet, including emigrant texts, and post-Soviet).
Each of these periods is represented by a large text file in the UTF-8 format with the original sentences of the texts in random order. Such violation of text integrity is due to copyright restrictions. The texts are not supplied with either morphological or metatext annotation.
The total volume of the datasets is 250 million words.
Multilingual dataset of RNC
The multilingual dataset includes the entire multilingual subcorpus of the RNC Parallel Corpus as of Autumn 2021. The dataset contains 12 pieces of fiction and their translations (each original text is supplied with 10 to 25 translations into a variety of languages). The dataset includes classics of world literature and modern bestsellers, such as The Master and Margarita, The Little Prince, or The Da Vinci Code. The dataset is a single UTF-8 encoded json file containing sentence tuples: aligned paragraphs in different languages collected from the original texts and their translations. For copyright reasons, the order of the paragraphs was randomized. The dataset is cleared of linguistic annotation and meta-annotation. For language tags, ISO 639-1 codes are used. A distinctive feature of the dataset is that it consists solely of fictional texts (narratives with dialogue), while most other multilingual datasets contain news articles, business documents or subtitles.
About 5 million words.