RNC News

In the Word at a glance service, the morphemic structure of each word is visualized: prefixes, roots, suffixes and endings are highlighted using the geometrical signs adopted in the school Russian language teaching. The word structure annotation is based on the morphemic dictionary specially developed for the corpus. Automatical annotation is added for the lemmas that are absent in the morphemic dictionary by the NeuroRNC algorithm. Please note that the morphemic structuring of words may differ from what you are accustomed to (see "Principles of annotation").

Errors of automatic annotation are always possible. Please report errors using the "Rate" button.

The multilingual parallel corpus is available in the new interface, as well as within the Word at a glance and Get overview services. Now all the parallel corpora are available in the new interface.

For the Old East Slavic corpus, the Word at a glance service and Word frequency widget are available.

The Poetry corpus has been expanded by 400,000 word uses. In particular, new texts by twentieth-century poets have been added, as well as a large collection of Russian translations of ancient poetry, including hexametric versions of Iliad, "Aeneid" and Horatian "Satyres".

All the parallel bilingual corpora are now available in the new interface.

The interface of the Old East Slavic corpus has been substantially updated, it is now connected to the Overview feature. The selection of a subcorpus within the Old East Slavoc corpus is now available on a separate page. You can select from a list one or more Slavic literary monuments to be searched.

In the collocations search the user can specify the syntactic links. For example, if a user specifies решение 'solution' as the key, "verb" as a grammatic feature of the collocate, "object" as a syntactic role, the second word as a dependent, they can find out what is most often done with solutions (принять 'accept', согласовать 'agree', etc.). The table with the search results will show the 100 most frequent collocations with this syntactic relationship. For each of these collocations you can access a list of examples by clicking on the link.

Users of the Main corpus now can get frequency dictionaries by major parts of speech: nouns, adjectives, verbs, and adverbs. The same selection is available in the subcorpus frequency dictionary as well. Now you can specify the part of speech when comparing the most frequent lemmas of your selected subcorpus with the frequency dictionary of the whole corpus.

The parallel corpora started migrating to the new interface. At the end of April the following corpora are available with it:

For each bilingual pair, within the search form you can select any of three options: exact forms search, lexico-grammatical search or bilingual search. An important innovation is that in the new interface, the bilingual search is available on the main search page rather than on a new one. Queries in Russian and other languages are entered in two different query forms. The search results are formatted in two columns. This layout is already familiar to the users of the Birchbark letters corpus. On the left you see the original, and on the right, all the available translations.

This year, the RNC actively collaborated with Total Dictation, an annual educational event that unites people who speak Russian and strive to write correctly. 

On the day of the dictation, Vladimir Plungian shared his thoughts on why the RNC is necessary for both linguists and non-linguists, how it changes, and which years were the most productive in the history of the Corpus. Watch the recording of the conversation; it's informative and exciting.

The Old East Slavic corpus features fourteen new texts with a total size of 120,000 tokens, including such famous works of Old East Slavic literature as "Sermon on Law and Grace", "Daniel Zatochnik's Prayer",  "Kyivan Cave Patericon", the Slavic translation of "The Life of Basil the Younger". The corpus now includes different textual versions of different texts such as "Tale of Bygone Years", "Life of Theodosius", or the cycle on Boris and Gleb. More than 1,000 Old East Slavic lexemes have been added to the corpus, including the ancestors of such Russian words as выискивать, известие, избранник, пчелка, невежественный, стремглав, умышлять.

We continue to update the «Word at a Glance» feature. Now you can see the "Similar Words" cloud and Word frequency in the Middle Russian corpus and the "Similar Words" cloud in the Birchbark letter corpus.

Beta testing of the «Similar words» cloud within the «Word at a Glance» option continues. Thanks to your feedback we were able to improve the vector model that looks for similar words. We are waiting for new feedback on the «Similar words» clouds in the Main and Regional corpora and for your reaction on the «Similar words» cloud in the Middle Russian corpus. You can leave feedback by clicking the «Rate» button next to the feature. 

The five examples in the «Word at a Glance» feature are now selected at random, which means that with each new viewing of the Word at a Glance feature there is a chance to see something new.

Word Portrait has been updated: 

  • Sketches and "Similar Words" are now available not only in the Main corpus, but also in the Regional corpus as well
  • The information about frequency of words has been added
  • The most frequently occurring part of speech for the target word is now displayed in the portrait (e.g. noun is displayed first for query печь, and verb is displayed first for query стать)

A new corpus of Social networks appeared within the RNC. It features more than 160 million word uses since 2007. All texts are taken from open sources: VK, Telegram, Livejournal, Liveinternet, Blogspot. “Social networks" are defined as widely as possible, including blog posts and messengers. Language in social networks is the most dynamic and free from regulatory restrictions. It reflects changes in vocabulary (including slang), semantic evolution, developments in grammar, and typical mistakes.

The Dialect corpus interface has been substantially updated, the corpus is connected to the “Overview” feature. Metatext markup has been edited (in particular, the selection of geographical locations has been improved). Within the dialect corpus, running multimedia clips can be performed directly in the output interface.

We launch a β-version of the search for user manuals, corpus descriptions, announcements, and other materials available on the RNC site. The current version of the Site Search has some limitations, please read the description

The Word Portrait functionality in the main corpus has been improved and expanded:

The new Sketches section allows the user to understand how a word interacts with other words in the language. This interaction is defined through the compatibility (collocations) with words of different parts of speech. This takes into account the various syntactic functions of a word in a sentence, which cover the main domains where a given word “works” in the language. One can search what respect is like in Russian and what one may make with it. Another query explores the compatibility of the word bring in Russian. Whereas one usually brings some abstract notions (as far as the texts show us), the most frequent subjects of bringing are quite material.
For nouns, adjectives, verbs, and adverbs, up to 10 of the most closely related words in each sketch are displayed. For other parts of speech, the sketches are not shown.

The Similar Words section now uses its own model to search for semantic associates, trained on actual texts from the main corpus of the RNC. The new model allowed us to reduce the number of errors. But due to the fact that the selection of similar words is fully automatic, errors (e.g. non-existent word forms) can still occur.

To see all the information on a given word, you can now use the Word at a glance functionality. As of today, the Word Portrait includes:

  • grammatical and semantic properties of the word
  • Similar words (β, only in the main corpus)
  • word usage examples in the corpus
  • distribution of examples by year and by type of text

For quick access to the Word Portrait and other corpus features as well as to the User's Guide you can now use buttons on the main page of ruscorpora.ru.

The output view Frequency has been improved: 

  • The "Contexts" column has been added
  • Grouping of results can be either disabled or applied for some words only. Users can retrieve combination of words with any distance between them (within the distance specified in the original query). Some of the words can be grouped by lemma/word form/grammatical features, and the remainder is retrieved without grouping. For example, for the query красивый ('beautiful’) + any noun one can get the frequency distribution of all nouns found in the search results and the overall frequency for the combination with any noun as well
  • The size of the downloaded table with "raw" data can reach 5000 lines

The frequency dictionary of a subcorpus as compared to the entire corpus can be sorted by differences of lemma ranks. The lemmas that are found only in the subcorpus top 500 frequent items are given first, followed by those with the highest gain in frequency rank with regard to the statistical population. For example, the frequency dictionary of texts written by women can be sorted in such a way that it starts with the characteristic lemmas like девочка (‘girl’), стараться ('try’), проблема (‘problem’), искусство (‘art') etc.

A new corpus "Russian classics" is available. It includes poetic, prosaic, journalistic and epistolary works from representative academic editions by Russian classical writers of the 19th - early 20th centuries: Pushkin, Baratynsky, Gogol, Tolstoy, Turgenev, Chekhov and others. A significant part of these texts are also included into the Main or Poetic corpus. Currently the corpus is in beta-version ("Russian classics β"). New authors and works are to be added later. The size of the corpus is more than 17.5 million tokens.

The interface of the Birchbark letter corpus has been substantially updated, and the corpus is connected to the Overview feature. Early Old East Slavic lemmas are available for search (not only слати, but also сълати ‘send’). An important innovation is that the original and translated texts are now shown in two columns, and a translation (Russian or any of the two English variants) can be chosen to be displayed.

The functionality of the main corpus has been significantly upgraded. Now it features lexical and grammatical annotation with automatic homonymy resolution and automatic syntactic annotation. Within the main corpus grammatical homonyms are disambiguated. It is also searchable by syntactic parameters such as types of compound sentences, predicative phrases (clauses), complements, copulas, and many others. With this new annotation, all the new functions that appeared earlier in the corpus of regional media are available in the main corpus: Searching for collocations, Frequency dictionary, Search results types. Frequency.

In addition, the main and newspaper corpora now are searchable for lemmas and word forms using regular expressions (β-version). They feature corpus and subcorpus statistics: overall size in texts and words, a geographical map (for the Regional media corpus only) and charts of metatextual attributes. These functions allow users to compare a given subcorpus with the bulk of the corpus, including the visualization.

The interface of the Church Slavonic corpus has been substantially updated, and the corpus is connected to the Overview feature.

The multimedia corpus counts 5.7 million tokens.
The parallel corpus counts 168 million tokens. Now it features four new language pairs with Russian, namely two larger South Slavic subcorpora, Serbian and Slovene, as well as two smaller pilot corpora of Korean and Hindi, both coming with transliteration and dictionary support. The Korean and Hindi tiers include aligned poetical texts, a new feature within the parallel corpus. The Czech and Spanish language pairs are also updated.

The interface of the Middle Russian corpus has been significantly updated; the corpus is connected to the Overview feature.

The Regional corpus has a new type of output: Frequency. With it, the statistical distribution of search results by lemmas, word forms and a set of grammatical features can be analyzed. The frequency is calculated based on texts with automatically removed homonymy over a random subsample of 1 million search results. Users can control the confidence level to compare frequency confidence intervals.

The Dialect corpus has been updated and now contains 604 thousand tokens.
The SynTagRus has increased by 30 thousand tokens.

The corpus and subcorpus frequency dictionaries now feature 500 top lemmas rather than 100.