RNC News

The RNC sums up the results of 2022. There have been many changes this year: the size of the whole Corpus reached 1.5 billion tokens, two new subcorpora were inaugurated (Panchronic and "From 2 to 15"), the corpus of birchbark letters went parallel, the regional corpus is now disambiguated and features collocations and frequency tools. The old version of the RNC is now closed. The RNC is also migrating to a new interface. All the changes are shown in more detail in the picture.

The Birchbark letter corpus is updated with the texts of archaeological findings from the year 2021, published in 2022. These are the new documents from Veliky Novgorod and Staraya Russa, as well as the very first birchbark letter discovered in Pereyaslavl Ryazansky (modern-day Ryazan). 

A new corpus "From 2 to 15" featuring 75 original and translated texts in prose, read by modern children and teenagers, is inaugurated within the RNC. The main distinguishing feature of the new corpus is the automatic annotation of text fragments according to the age of readers that should be able to understand these fragments. The model is being tested, so mistakes can still appear in this annotation.

The Educational corpus interface has been substantially updated, the corpus is now available with the Get overview feature, and the description of the corpus has been updated.

A section "Corpus-based exercises" has been developed, which presents exercises based on the Educational Corpus and other corpora within the RNC. The exercises belong to different sections of the school Russian course and are designed for independent work in the classroom and at home, as well as for testing. We plan to develop and update the section with new exercises. The corpus users can also participate in the process. We invite teachers and professors to use the corpus to compose their own unique assignments and exercises. Feel free to send us some of your best ideas on info@ruscorpora.ru. We will publish them in the exercises section.

Each corpus within the RNC has its own Corpus Portrait. The Corpus Portrait functionality is designed as a tool that allows a RNC user to analyze the characteristics of a given corpus and assess whether the corpus in question is suitable for their research or teaching needs. The corpus portrait at this stage includes:

* a description of the corpus

* frequency dictionary (only in the Regional Media Corpus)  

All the RNC corpora have tags on a meta-corpus level, allowing to categorize them by historical period, text type, presence of specific annotation, etc. 

If a subcorpus has been customized, users also have access to the respective "Subcorpus Portrait" function. With this tool, by clicking the (i) link in the header of the corpus, one can see the list of selected texts and compare the statistical characteristics of the corpus to the respective parameters of the customized subcorpus. For example, one can compare the frequency dictionary of the regional corpus to those of subcorpora selected within it.

In 2023, more statistics will appear in the Corpus and Subcorpus Portraits.

The Panchronic Corpus has been created as part of the RNC. It combines three historical corpora (the Old East Slavic corpus, the Birchbark letters corpus and the Middle Russian corpus), and the Main corpus. Taken together, the Panchronic Corpus covers a thousand years of Russian language history, from the 1020s to the 2020s. Within the Panchronic Corpus you can build a query and find results simultaneously on this entire chronological range.

For this purpose, we have unified the presentation of lexical, orthographic and semantic markup. Lemma can be queried in either Early Old East Slavic (съвѣдѣтель), Middle Russian (свѣдѣтель or свидѣтель) and Modern Russian (свидетель): both historic and modern examples can be found for each of these queries. Similarly, word forms can be specified in different appearances. Historic texts have been given lexical-semantic annotation.

Concordances and frequency charts for all ten centuries are now available to the user for such queries as "preposition po with locative case", "history of the noun zabava", "distribution of verbs of motion with abstract nouns as subjects", "proper names in -slav".

The corpus of regional media is searchable for collocations. For this search mode a statistical approach is used. Collocations are combinations of words that occur together more often than by chance. Such statistical measures as Dice, Loglikelihood, t-score, MI3 and aggregated measure (geometric mean of t-score and MI3 measures) are used to calculate the collocations.

For more information about the new functionality, see here.

The parallel corpus counts 168.8 million tokens. The Czech, German, English, French, and Spanish language pairs have been expanded with new texts.

The Church Slavonic corpus is updated and comprises 5.2 million tokens. It partially includes the "Green Menaion" edition of 2002, using the civil (modernized) orthograpĥy. The Church Slavonic corpus features a more detailed metatextual annotation. All the texts are annotated by date of publication. Texts of the Modern era are provided with dates and authorship, and the recent (beginning with the 18th century) liturgical texts also feature information about their drafting and approval.

The Middle Russian corpus has been expanded to 8.8 million tokens. Among the added text is volume of the "Library of Literature of Old Rus", dedicated to the 17th century (prose stories and songs), the earliest texts from the Letters and "Papers by Peter the Great", as well as  the 16-century "Embassy book on relations with the Crimean Khanate". The morphological annotation of the texts previously included in the corpus has been corrected and updated.

The search interface for the media corpora, both national and regional, has been updated. Media corpora are now suggested when the «Feature overview» function is activated, and their descriptions in Russian and English have been redesigned and updated.

The following changes have been made to the new interface of ruscorpora.ru:

On the home page, by clicking on the «All corpora» link, you can now open a full list of 38 corpora (including all the bilingual tiers of the parallel corpus, all the historical corpora, etc.). You can go to the search form for any corpus by clicking on its name.
The «Statistics» page also has a full list of corpora with data on the number of texts, sentences, and tokens.

The search and subcorpus selection forms for all corpora transferred to the new interface have been improved. The «Lemmas and tags» search form is expanded by default; if desired, the user may expand the «Exact search» query bar. The lemma entry field is displayed first in the query form. When selecting a subcorpus, an option is provided to select the date range of the corpus release.

Using the menu on the Search button, the user can now select their preferred type of output (concordance, KWIC, graphs, n-grams). The user's choice will be remembered and will be used by default since. 

On clicking a word in its popup window «Similar words» are displayed. These are the words that are semantically closely related to the word in question and are used in similar contexts. The closeness coefficient in brackets is calculated using distributional semantics models. It is built using the main corpus of the RNC and provided by the RusVectōrēs project. Read more about this experiment here

It is planned to gradually transfer the other corpora to the new interface and platform. Feel free to use the new version of the site and report any bugs you notice.

The accentological corpus has been updated and reached the size of 133.8 million tokens.
The spoken corpus has been updated and now contains 13.9 million tokens.

The search interface of the Main corpus has been significantly updated. The RNC team is committed to making the corpus search meet the modern standards. We take into account the users’ feedback and usability parameters.

For those users who are just getting acquainted with the new interface or with the corpus itself, the start page has a new functionality named "Feature Overview". By entering a word or phrase, the user gets acquainted with the search queries and results available in the RNC. They learn about possible errors while building a query and can navigate directly to the search. 

The interface of the Main corpus has undergone the following changes:

Within the lemma and tags search form of the Main corpus, the query constructor blocks that specify the features of separate Words are now situated from left to right, rather than from top to bottom. This allows the user to add as many Words as they want. For each Word only the features the user needs for their query can be specified. The set of features available for query within the main corpus now has a separate "word form" field that allows for specifying exact tokens.

While the features are specified a search formula appears at the top of the pop-up window that combines the specified values. While specifying the text attributes, feature lists now automatically take into account the changes in text annotation.

Subcorpora can now be customized both before and after the query, and instead of pop-ups with multiple features for fiction and nonfiction texts, more compact lists have been designed.

Both query and subcorpus parameters are automatically saved and can be edited at any time. 

The search results page displays all the query parameters and subcorpus parameters (if the latter are specified). All the settings and sorting method choices are now displayed at the top of the page and saved in the user's browser. 

This list of changes is far from complete. You can read more about the changes in the user manual.

The internal structure of the system has also changed significantly. The main corpus has been moved to a new corpus platform, developed within the framework of the grant 075-15-2020-793. The corpus platform, the corpus configuration, and the user interface are now separate parts of the RNC that are nevertheless linked via an API. 

A gradual migration of the remaining corpora to the new interface and platform is under way.
Feel free to use the new version of the site and report any bugs you find.

Our large corpora now feature new layers of annotation built using neural network methods, namely lemmatization and grammatical annotation with automatic disambiguation and automatic syntactic parsing. The annotation in question is searchable within the Regional and international media corpus; at the next stage this will become available for the Main and Media corpus.
Morphological homonyms are automatically tagged throughout the regional corpus: for example, the noun печь is now tagged differently from the verb печь,  and the dative case is marked separately from the prepositional case. The user can search for syntactic parameters such as different types of multiclausal sentences, clauses, complements, copulas, vocatives, and many others. The syntactic annotation within the Regional corpus is organized differently from the separate Syntax corpus and is more strongly oriented towards the syntax of the constituents. 
Feel free to use the new search features and report any errors you notice to us.

The Syntax corpus has been significantly updated with information about the texts, namely the gender of the author, the topic and type of the text, its source and the date when the document was annotated and added to the corpus. For sentences with unchangeable  multiword units (such as потому что or по меньшей мере) two variants of sentence structure are shown, featuring these multiword units as a single token (resp. structural node) or as multiple ones. The size of the corpus has reached 1.5 million tokens.

The Parallel Corpus is updated. The Czech-Russian part includes materials from modern Czech media, as well as fiction and journalism of the 19th-21st centuries. The French-Russian part features fiction and academic texts. The overall parallel corpus size is 166 million tokens.