RNC News

On the upcoming Sunday, December 10, from 10:00 to 18:00 Moscow time, technical maintenance will be carried out on our servers.

Due to this, short-term interruptions in the website's operation may occur, which will last no more than 1 hour.

Speech collections within the Accentological and Spoken corpora were expanded. Transcripts of academic and political talks, TV and radio broadcasts, personal oral history, and everyday dialogic speech have been added. The size of the Spoken corpus amounts to 14 million tokens, the overall size of the Accentological corpus, the naive poetry collection included, is 134.8 million tokens.

The parallel corpus was expanded by 3 million tokens. New texts appeared within the language pairs of Czech, English, French, German, Portuguese, and Spanish with Russian. In particular, the English-Russian tier was updated with a collection of transcripts of public TED Talks, while the Portuguese-Russian subcorpus has almost doubled in size and now also includes texts created in Portuguese-speaking Africa.

In the Social Networks corpus, genres are automatically marked for all the text. Users can select one or more genres from the list. Several new genres have been identified such as picture captions.
Properties generated by NeuroRNC are marked with a special icon. If you notice an inaccuracy or error, feel free to report it using the “Report an Error” button in the same window.

You can specify a subcorpus within the Regional Media corpus by specifying intervals with day accuracy. For example, you can explore the use of the word милиция.

Within the same corpus, Graphs are now plotted using days, months or years as units of measurement. The default unit of measurement is now month. You can switch between days, months and years on charts. The option is available in the search results, and the functions of Get overview, Compare queries and Word at a glance.

Since December 2023, two registration and authorization options are available — directly on the website and via Yandex ID

A small fraction of users have registered on the RNC via ORCID.org. Due to the changes in the legislation of the Russian Federation that came into effect, this option of authorization is no longer available. We apologize for any inconvenience caused to users who previously registered via ORCID.org. Please register again to access the advanced functionality of the RNC.

The Word at a glance service is available not only with the Main corpus, but in general with all the corpora featuring the new interface. 

Now getting into the Word at a glance service has become easier. In the header of each corpus there is a button navigating to the Word at a glance service within this particular corpus. 

Don't forget that there are other ways to access Word at a glance service:

  • from the homepage of the RNC you can go to the Word at a Glance within the Main corpus, find a word, and then switch to any other corpus and see this service with the same word within a new corpus. 
  • The link to the Word at a glance service is available also in the search results, in the word-by-word annotation pop-up window. 
  • The Word at a glance service can be accessed via direct links. Meet this function for the word слово within the Old East Slavic corpus and within the Parallel Russian-Chinese corpus.

The Multimedia corpus has been expanded to 5.8 million words. The corpus features new collections of public and non-public speech recorded in different regions, as well as collections of TV publicity and theatrical speech.

Graphs are one of the most sought-after tools for analyzing search results in a corpus. It is important that the conclusions you draw from the graphs take into account the relevant information. For this purpose, we have updated the graphs with several auxiliary tools.

Using the "windows" displaying dates and frequencies in the graphs, you can zoom in or out on certain parts of the graph, and move around values on the axes. This is useful when you want to look at a narrower time or frequency range within a larger amount of data.

In the Main, Media, and Educational corpora, below the graphs there are "warming stripes" showing the number of texts in which examples are found. The intensity of the color of the scale draws users' attention to the fact that a change in the shape of the graph does not necessarily mean a change in the number of uses of the word, but that it may be due to a small number of texts found. In such cases, you may plot a graph without smoothing to verify your findings.

A graph can now be downloaded as a high resolution picture.

As a reminder, two weeks ago we updated the Educational corpus and added the state-of-the-art tools to the corpus.

We had so many updates that they didn't fit in the announcements of the previous release, so we keep you in touch about the updates.

Meet the updated RNC School page. Here you will find useful information on how to use the Russian National Corpus at Russian and Literature school lessons and for independent work at school and at home. 

We have updated the section with materials for teachers, as well as a collection of exercises, including complex ("olympiad") assignments.

The Word at a Glance service within the Educational Corpus has been supplemented with the Morphemic structure widget. In other RNC corpora, morphemic structuring is based on a morphemic analysis dictionary specially developed for the corpus. Such structuring is intended for researchers and may not coincide with those accepted at school.

A special version of the widget is implemented in the Educational Corpus. Here the morphemic structure of a word is determined in accordance with the practice of morphemic analysis at secondary school and is based on Alexander Tikhonov's Morpheme Orthographic Dictionary (2002) that contains about 100 thousand lexemes. For words not included in the dictionary, the morphemic structure is not determined. For more details on the morphemic annotation within the RNC, see description.

The Russian National corpus now has a brand new historical subcorpus, the one of East Slavic Epigraphy (Inscriptions)

This corpus includes 663 morphologically annotated, mainly brief texts dating back to the 11th-15th centuries from the territories of modern Ukraine, Russia and Belarus, as well as those found outside Eastern Europe, as in Germany, France, or Turkey. These are inscriptions on the walls of churches, on stones, and on objects found during excavations or stored for many centuries. This is a valuable source both on the history of everyday Old East Slavic speech and on the existence of ecclesiastical, literary and folklore texts. From each text users may navigate to the epigraphica.ru database, where more details about the text and photographs are included.

The Old East Slavic corpus has been updated. Its size exceeded 800 thousand tokens. The corpus now features new texts. These are several dozens of official texts dating back to 12th-14th centuries: princely laws, charters from Novgorod, Polotsk, Ukraine and the Grand Duchy of Lithuania. For the first time the early corpus includes the ancestors of such modern Russian words as блистать, больница, великолепие, доход, дружить, наслаждение, околица, простоволосый, ящерица.

The Russian and English-Russian parallel MultiPARCs have navigated to the new interface. Word at a glance and Get overview features are now available.

To select your own subcorpora and use our statistical services, you need metatextual annotation. RNC already features more than 6 million texts and is constantly growing. It is becoming less and less feasible to mark up such a large amount of data manually, so we are developing neural annotation services (NeuroRNC). Today we present new results in the field.

Keywords in the texts of the Regional Media corpus are annotated automatically using the adapted rutermextract model. One keyword can consist of a single token (праздник, переломы) or a two-word combination (таяние снега). The single-token query (сообщество) yields both exact matches and two-word combinations with this word (католическое сообщество).

In the Social Networks corpus, genres are automatically labeled for the main corpus texts. The RuRoBERTa model, fine-tuned on the corpus texts, is used for annotation. One or more genres can be selected from a list, e.g., recommendations and advice.

In the text information, the fields filled in by NeuroRNC are marked with a special icon. In the same pop-up window there is a "Report a bug" button. Please let us know about any inaccuracies or errors in the definition of keywords and genres.