The Russian National Corpus is a representative collection of texts in Russian, counting more than 2 bln tokens and completed with linguistic annotation and search tools
Search in corpora
News
The Russian MultiPARC has been expanded and counts almost 300 thousand tokens. It now features Chekhov's play “Three Sisters” staged by four different theaters: Gorky Moscow Art Theater, Maly Theater, Fomenko Workshop, and the Sovremennik Theater.
The Russian MultiPARC provides an opportunity for comparative study of the same phrase uttered by different speakers in the same circumstances. The comparison of different utterances of the same phrase makes it possible to determine which intonational, structural, phonetic, and gesture features of this phrase are obligatory, reproduced by all speakers, and which are unique or accidental.
Materials on the structure of the corpus and its functions can be found here.
The Russian National Corpus is a powerful tool for analyzing and researching language. It contains millions of texts that allow its users to better understand the language in all the diversity of its forms. One of the most important aspects of processing the corpus is analyzing statistical data.
The summary statistics of the RNC is available from the main page. This section contains data on size of the corpus included in the RNC (texts, sentences, and tokens), as well as tables with the distribution of texts of the Main corpus by types and other metatextual parameters.
By clicking on the corpus name in the table, you can navigate to the statistics in the Corpus portrait of the selected corpus. You can also navigate to the corpus statistics from the query form by clicking on the icon (i). Now the corpus statistics are available for the Main, Educational, and Media corpora, some historical corpora, as well as “Russian Classics” and “From 2 to 15”.
In corpora with advanced statistics, one can compare a customized subcorpus with the entire corpus. To view compared data, click on the icon (i) in the subcorpus header.
The parallel corpus was expanded by 3 million tokens. Half of this amount is accounted for by English-language non-fiction texts (popular science and journalistic). In addition, the German and Spanish language pairs have been updated, mainly with works of fiction.
In the three language pairs that feature field transcripts of spoken texts, Veps, Karelian, and Khakas, subcorpus selection by dialect is available.
For users who are just getting acquainted with the Corpus, the “Features Overview" is available on the main page.
In October, we enhanced this service by adding new widgets and making existing widgets more informative. Now, the "Features Overview" is common across all RNC corpora.
A new text widget has been introduced that allows users to familiarize themselves with the basic terms used in the RNC interface, learn how to start a search, understand the different types of searches available, and find out where to read more about them.
Lemma ang tags search, exact search, and collocation search now yield results only from the Main Corpus.
The "Random Poem" widget now displays not only the poem itself but also its title, author, and date of creation.
The names of the corpora in the widget headers are now clickable. By clicking on the link, users will be directed to the "Portrait of the Corpus," where they can explore its structure, learn more about the creators of the corpus, and read publications about it.