RNC News

The Panchronic Corpus has been created as part of the RNC. It combines three historical corpora (the Old East Slavic corpus, the Birchbark letters corpus and the Middle Russian corpus), and the Main corpus. Taken together, the Panchronic Corpus covers a thousand years of Russian language history, from the 1020s to the 2020s. Within the Panchronic Corpus you can build a query and find results simultaneously on this entire chronological range.

For this purpose, we have unified the presentation of lexical, orthographic and semantic markup. Lemma can be queried in either Early Old East Slavic (съвѣдѣтель), Middle Russian (свѣдѣтель or свидѣтель) and Modern Russian (свидетель): both historic and modern examples can be found for each of these queries. Similarly, word forms can be specified in different appearances. Historic texts have been given lexical-semantic annotation.

Concordances and frequency charts for all ten centuries are now available to the user for such queries as "preposition po with locative case", "history of the noun zabava", "distribution of verbs of motion with abstract nouns as subjects", "proper names in -slav".

The corpus of regional media is searchable for collocations. For this search mode a statistical approach is used. Collocations are combinations of words that occur together more often than by chance. Such statistical measures as Dice, Loglikelihood, t-score, MI3 and aggregated measure (geometric mean of t-score and MI3 measures) are used to calculate the collocations.

For more information about the new functionality, see here.

The parallel corpus counts 168.8 million tokens. The Czech, German, English, French, and Spanish language pairs have been expanded with new texts.

The Church Slavonic corpus is updated and comprises 5.2 million tokens. It partially includes the "Green Menaion" edition of 2002, using the civil (modernized) orthograpĥy. The Church Slavonic corpus features a more detailed metatextual annotation. All the texts are annotated by date of publication. Texts of the Modern era are provided with dates and authorship, and the recent (beginning with the 18th century) liturgical texts also feature information about their drafting and approval.

The Middle Russian corpus has been expanded to 8.8 million tokens. Among the added text is volume of the "Library of Literature of Old Rus", dedicated to the 17th century (prose stories and songs), the earliest texts from the Letters and "Papers by Peter the Great", as well as  the 16-century "Embassy book on relations with the Crimean Khanate". The morphological annotation of the texts previously included in the corpus has been corrected and updated.

The search interface for the media corpora, both national and regional, has been updated. Media corpora are now suggested when the «Feature overview» function is activated, and their descriptions in Russian and English have been redesigned and updated.

The following changes have been made to the new interface of ruscorpora.ru:

On the home page, by clicking on the «All corpora» link, you can now open a full list of 38 corpora (including all the bilingual tiers of the parallel corpus, all the historical corpora, etc.). You can go to the search form for any corpus by clicking on its name.
The «Statistics» page also has a full list of corpora with data on the number of texts, sentences, and tokens.

The search and subcorpus selection forms for all corpora transferred to the new interface have been improved. The «Lemmas and tags» search form is expanded by default; if desired, the user may expand the «Exact search» query bar. The lemma entry field is displayed first in the query form. When selecting a subcorpus, an option is provided to select the date range of the corpus release.

Using the menu on the Search button, the user can now select their preferred type of output (concordance, KWIC, graphs, n-grams). The user's choice will be remembered and will be used by default since. 

On clicking a word in its popup window «Similar words» are displayed. These are the words that are semantically closely related to the word in question and are used in similar contexts. The closeness coefficient in brackets is calculated using distributional semantics models. It is built using the main corpus of the RNC and provided by the RusVectōrēs project. Read more about this experiment here

It is planned to gradually transfer the other corpora to the new interface and platform. Feel free to use the new version of the site and report any bugs you notice.

The accentological corpus has been updated and reached the size of 133.8 million tokens.
The spoken corpus has been updated and now contains 13.9 million tokens.

The search interface of the Main corpus has been significantly updated. The RNC team is committed to making the corpus search meet the modern standards. We take into account the users’ feedback and usability parameters.

For those users who are just getting acquainted with the new interface or with the corpus itself, the start page has a new functionality named "Feature Overview". By entering a word or phrase, the user gets acquainted with the search queries and results available in the RNC. They learn about possible errors while building a query and can navigate directly to the search. 

The interface of the Main corpus has undergone the following changes:

Within the lemma and tags search form of the Main corpus, the query constructor blocks that specify the features of separate Words are now situated from left to right, rather than from top to bottom. This allows the user to add as many Words as they want. For each Word only the features the user needs for their query can be specified. The set of features available for query within the main corpus now has a separate "word form" field that allows for specifying exact tokens.

While the features are specified a search formula appears at the top of the pop-up window that combines the specified values. While specifying the text attributes, feature lists now automatically take into account the changes in text annotation.

Subcorpora can now be customized both before and after the query, and instead of pop-ups with multiple features for fiction and nonfiction texts, more compact lists have been designed.

Both query and subcorpus parameters are automatically saved and can be edited at any time. 

The search results page displays all the query parameters and subcorpus parameters (if the latter are specified). All the settings and sorting method choices are now displayed at the top of the page and saved in the user's browser. 

This list of changes is far from complete. You can read more about the changes in the user manual.

The internal structure of the system has also changed significantly. The main corpus has been moved to a new corpus platform, developed within the framework of the grant 075-15-2020-793. The corpus platform, the corpus configuration, and the user interface are now separate parts of the RNC that are nevertheless linked via an API. 

A gradual migration of the remaining corpora to the new interface and platform is under way.
Feel free to use the new version of the site and report any bugs you find.

Our large corpora now feature new layers of annotation built using neural network methods, namely lemmatization and grammatical annotation with automatic disambiguation and automatic syntactic parsing. The annotation in question is searchable within the Regional and international media corpus; at the next stage this will become available for the Main and Media corpus.
Morphological homonyms are automatically tagged throughout the regional corpus: for example, the noun печь is now tagged differently from the verb печь,  and the dative case is marked separately from the prepositional case. The user can search for syntactic parameters such as different types of multiclausal sentences, clauses, complements, copulas, vocatives, and many others. The syntactic annotation within the Regional corpus is organized differently from the separate Syntax corpus and is more strongly oriented towards the syntax of the constituents. 
Feel free to use the new search features and report any errors you notice to us.

The Syntax corpus has been significantly updated with information about the texts, namely the gender of the author, the topic and type of the text, its source and the date when the document was annotated and added to the corpus. For sentences with unchangeable  multiword units (such as потому что or по меньшей мере) two variants of sentence structure are shown, featuring these multiword units as a single token (resp. structural node) or as multiple ones. The size of the corpus has reached 1.5 million tokens.

The Parallel Corpus is updated. The Czech-Russian part includes materials from modern Czech media, as well as fiction and journalism of the 19th-21st centuries. The French-Russian part features fiction and academic texts. The overall parallel corpus size is 166 million tokens.

Since August, the new version of the Russian National Corpus is the only one available for searching the entire corpus. The old version of the corpus is closed.

The Russian and English-Russian multimedia parallel corpora have been improved, and a number of minor errors in these corpora have been fixed.

More search results are downloadable in Excel format from the Main and Media corpora. As many as 5000 examples can be saved into an Excel table no matter how the search results are customized.

In the Multimedia corpus the detailed query of gestures (specifying active vs. passive organs) is activated, and some other search errors are fixed.

The main corpus reached the size of 375 million tokens. It is updated by new texts including but not limited to: diaries and memoires of the 19th-21th centuries from the «Prozhito» project; pre-revolutionary fiction, journalism and private letters both in old and modern orthography, including mass literature; post-1917 and contemporary prose; a collection of tourist guides; a collection of different academic genres (abstracts, programs, textbooks, problems), a collection of technical guides and instructions.

The Old East Slavic corpus is now sortable by date, including the date of the manuscript, and by genre.

The RNC website has been redesigned. The start page and the pages with general information on the Corpus are now displayed with a new interface. The project description has been revised and updated. Current information on the structure and composition of the subcorpora and other pages is now available. A FAQ section is added explaining the main features of the Corpus.

The English version of the site has also been partially updated. The new website is fully adapted for mobile devices.

The search query and search results pages have not been redesigned yet. Gradually, all of them will switch to the new interface. Please use the new version of the site and feel free to provide us with feedback on all the errors you have noticed.

The Old East Slavic corpus has been updated and now contains 655 thousand tokens. It includes texts of the 11th-14th centuries, representing a variety of genres. They feature such famous works as Lives of Boris and Gleb, The Testament of Vladimir Monomakh and The Tale of Igor's Campaign, as well as other hagiographic, didactic and canonical texts. A collection of Old Novgorod business documents (gramoty), both on parchment and paper, has been added. The Old East Slavic metatextual information now contains the date of the text and the date of surviving copy.

The corpus of birchbark letters is now a parallel corpus: it presents original texts aligned with their translations into Russian and English.

The poetry corpus has also been updated and now counts 13 million tokens. The update consists of poems by A. Vertinsky, G. Sapgir and others.

The parallel corpus now contains almost 163 million words. It has been updated with two new language pairs: Portuguese-Russian and Romanian-Russian. The Finnish-Russian text collection has been significantly expanded and now includes translations of fiction and journalistic texts, as well as the corpus of international treaties (we thank Mikhail Mikhailov who provided the texts). The collections of English and German texts in Russian translations have also been expanded.

Within the spoken corpus, a new search field 'Region' is now available.
Within the Old East Slavic corpus, it is now possible to search homonyms by semantics. In the Middle Russian corpus, a suggestion list has been attached to the Lemma field.