The Russian National Corpus is a representative collection of texts in Russian, counting more than 2 bln tokens and completed with linguistic annotation and search tools
Search in corpora
News
The Russian National Corpus is a powerful tool for analyzing and researching language. It contains millions of texts that allow its users to better understand the language in all the diversity of its forms. One of the most important aspects of processing the corpus is analyzing statistical data.
The summary statistics of the RNC is available from the main page. This section contains data on size of the corpus included in the RNC (texts, sentences, and tokens), as well as tables with the distribution of texts of the Main corpus by types and other metatextual parameters.
By clicking on the corpus name in the table, you can navigate to the statistics in the Corpus portrait of the selected corpus. You can also navigate to the corpus statistics from the query form by clicking on the icon (i). Now the corpus statistics are available for the Main, Educational, and Media corpora, some historical corpora, as well as “Russian Classics” and “From 2 to 15”.
In corpora with advanced statistics, one can compare a customized subcorpus with the entire corpus. To view compared data, click on the icon (i) in the subcorpus header.
The parallel corpus was expanded by 3 million tokens. Half of this amount is accounted for by English-language non-fiction texts (popular science and journalistic). In addition, the German and Spanish language pairs have been updated, mainly with works of fiction.
In the three language pairs that feature field transcripts of spoken texts, Veps, Karelian, and Khakas, subcorpus selection by dialect is available.
For users who are just getting acquainted with the Corpus, the “Features Overview" is available on the main page.
In October, we enhanced this service by adding new widgets and making existing widgets more informative. Now, the "Features Overview" is common across all RNC corpora.
A new text widget has been introduced that allows users to familiarize themselves with the basic terms used in the RNC interface, learn how to start a search, understand the different types of searches available, and find out where to read more about them.
Lemma ang tags search, exact search, and collocation search now yield results only from the Main Corpus.
The "Random Poem" widget now displays not only the poem itself but also its title, author, and date of creation.
The names of the corpora in the widget headers are now clickable. By clicking on the link, users will be directed to the "Portrait of the Corpus," where they can explore its structure, learn more about the creators of the corpus, and read publications about it.
The collections in the Accentological and Spoken corpora were updated. We added transcripts of expert talks, oral memories, and everyday dialogic speech. These texts were recorded in different regions, including Moscow, Tomsk, and Voronezh Oblasts, Republics of Buryatia and Mari El.
We would like to thank for collecting and processing the texts: students and staff of the Voronezh State University, students of the Lomonosov Moscow State University, Grigori Korotkikh (Ilshat association, Tomsk), Egor Kashkin (Group for the Study of Contact Interaction of Russian with Indigenous Languages of Russia, Vinogradov Russian Language Institute).
The size of the Spoken Corpus amounts to 14,8 million tokens, the total size of the Accentological corpus, including naive poetry, is 134.8 million tokens.
In both corpora, it is now possible to select texts by number of word forms. In the subcorpus selection form, regions in the Spoken Corpus are now grouped by countries for easy searching.