Russian National Corpus

Search in corpora

Structure of the RNC

26.04.2024

A new section is now available on the Corpus website. It describes the RNC neural network models used for annotating words and texts within the Corpus.

The users have access to the following tool:

the tokenizer
vector space models searching word associates and customized for 7 domains
models for morphemic annotation
models for annotating genre, topic, and type of text

The new section will be useful for everyone who is interested in natural language processing and wants to learn more about what machine learning technologies are used in RNC. Users can consult descriptions of the models or download them for their own use. Before downloading a model please read the license agreement and accept its terms.

26.04.2024

In April, the Old East Slavic corpus was considerably upgraded. It now features new types of search results, such as Frequency, Statistics, and n-grams. Using the Frequency feature users can build frequency lists of tokens and constructions. For example, one can check which nouns are coordinated most often in the corpus of the Early Medieval texts (‘Boris and Gleb’, ‘fear and trembling’ and others). The query results can be sorted by context. Frequency dictionaries are available while customizing subcorpus, and they can be compared to the lexical frequencies of the whole corpus.

The arrival of new functionality expands the possibilities of using the corpus and automates routine processes that previously took considerable time.

15.04.2024

We continue to roll out new functionality already available in the advanced corpora, such as Main, Media, and Learning, to other corpora. An improved version of the “From 2 to 15” corpus is now available to users of the RNC. All the texts within the corpus feature resolved grammatical homonymy and syntactic annotation. Syntactic relations search and collocation search are now available, as well as new output types such as frequency, n-grams, statistics.

The Word at a Glance function has been updated, and new types of sorting by context have been added.
In the Word at a Glance you can see that the words мама 'mom' and папа 'dad' are used much more often in texts for the children of 7-8 years old, and the words бабушка 'grandma' and дедушка 'grandpa' has an equal frequency rating for both the children of 7-8 years and for teenagers of 14-15 years.

The bar next to the fragment indicating the age of readers who should understand these fragments is now clickable. When you click, you will see the calculated classical readability indices: Flesch-Kincaid Index, Coleman-Liau Index, Automatic Readability Index, Simple Measure of Gobbledygook, Dale-Chull readability formula.

15.04.2024

In anticipation of the 20th anniversary of the National Corpus, we have significantly updated the publications page on our website. The list of publications about the Corpus has been expanded: the number of publications has increased by about 5 times! The section now includes both academic articles and other types of publications such as interviews, instructions, and social media posts.

The page of publications about the Corpus has advanced functionality: now you can find a publication about the Russian National Corpus in the search bar or using the filters on the right.

By default, the most popular filters are shown to the user. To see all available filters on the publications page, click "Show all". Combining multiple filters narrows the search and allows publications to be selected using multiple criteria.

Some publications can be downloaded by clicking on the icon to the right of the title. Other publications open in a separate window. You can share the list of selected publications by clicking on the "Copy link" button.

The Russian National Corpus is a representative collection of texts in Russian, counting more than 2 bln tokens and completed with linguistic annotation and search tools

Search in corpora

News