RNC News

Syntactic relations are available for search within the Lemma and Tags search forms in the Main and Regional corpora. Both syntactic features of a word and a syntactic relation between any two words can be specified. The Syntactic relation field is available in the Lemmas and tags search if the search form features more than one word, starting from the second one. To see it, click “Add condition”. The new functionality, for example, allows us to determine students of what educational institutions' are mentioned most frequently in the Main corpus.

Please note that syntactic annotation is featured within the RNC in two different formats, namely within the Syntactic corpus (the SynTagRus format) and within the Main, Educational, and Regional (the Universal Dependencies format). When switching between corpora with different syntactic annotation, syntactic parameters are not preserved within the search query.

For more information about the syntactic annotation in the RNC, see the Syntactic annotation page.

Following the tradition, on the last working day of the departing year, the RNC team reflects on our achievements and introduces the latest additions to the Corpus.

In 2023, we did a lot of new things, including the introduction of a new service called “Word at a glance”, new models for automatic annotation named NeuroRNC, the complete overhaul of our  user interface, and the introduction of new corpora and new tools for analysis and visualization.

We hope that in this picture everyone will discover tools that will enhance your experience and productivity while working with the RNC. May the New Year bring you many interesting discoveries!

Wishing you a prosperous New Year,

The RNC team

 

In December 2023, our team completed a large-scale project to migrate the RNC website to a new user interface. The project started in 2022 and users were introduced to the newly designed main page of https://ruscorpora.ru for the first time in May of last year. Throughout the project, we progressively implemented the new search interface for all corpora, as well as introduced various innovations and improvements to help our users complete their routine tasks faster and more efficiently.

Some of the key features of the new interface include:

1. Accessibility from mobile devices and the option to switch to the English version.

2. Features overview service for introducing the new interface to new audiences and highlighting the latest innovations.

3. Portraits of corpora, subcorpora, and words at a glance, providing users with diverse perspectives on the information.

4. Extensive visualization capabilities to visually represent complex information.

5. Quick access to standard tasks, such as shortcuts to main functionalities from the main page, information about the query and subcorpus in the page header, saving user preferences, short links to share results, and more.

For further details on these and other tasks that the new interface solves, read the article.

Syntactic Corpus (SynTagRus) is now available in a new interface!

Users can search for Еxact forms and Lemmas and tags search. In the Lemmas and tags search form, a compound field "Syntactic relation" has been added, where the user can specify which word the current word is related to, select its role (depends/controls) and the type of relation. In the compound field “Lexical function”, the user can specify which word and which lexical function the current word is related to, select its role in the relation (argument/value) and the functional word. For example, by setting the lemma вести as the first word in the lexico-grammatical search, and selecting the lexical function OPER1 as an argument as the second word, you will see what you can вести (прием, переговоры, кампанию, дневник).

Two types of output are available in the corpus, concordance and KWIC. By clicking on the icon “Show the structure” or “Show structure with multiword expressions represented word by word”, the user can view the syntactic structure of a sentence in the form of a dependency tree.

The morphological and syntactic annotation of the Syntactic corpus differs slightly from the basic morphological and syntactic standard of the RNC. More details on the types of annotation can be found in the Corpus portrait and the Types of Annotation section.

The Syntactic corpus has also been enriched by 28 thousand tokens.

The Panchronic corpus now takes into account recent additions to its constituent corpora, namely the Old East Slavic corpus and the Birchbark letters corpus. It also includes all the inscriptions from the new corpus of East Slavic inscriptions. The analysis of lemmas of the Middle Russian texts within the Panchronic corpus has been corrected and updated (about 3000 new lexemes). The table of correspondence between  lemmas and grammatical features of different historical periods has been corrected and supplemented with new data. Now these correspondences take into account parts of speech (for example, only the modern verb напасть, but not the homonymous noun, has the historical lemma напасти). Besides, within the Panchronic corpus it is now possible to customize a subcorpus by genre category of the text, depending on whether it belongs to the domain of literary texts, ecclesiastical, everyday, business or educational (one text can have several categories). This is important for studying the evolution of vocabulary and grammatical parameters that strongly depend on the genre.

The Regional corpus has been enlarged to 35.5 million tokens. It includes texts of 5 new newspapers and a large collection of Voronezh Oblast media prepared by the staff of Voronezh State University. These new texts were disambiguated and in the texts of the replenishment, grammatical homonymy was removed and syntactic annotation was introduced. The keywords to the texts were generated using the NeuroRNC language model.

The Poetry corpus now contains over one hundred thousand poems; the size of the corpus has grown by half a million tokens and is now close to 14 million. The works of ten poets have been added to the corpus. These are three volumes of poems by Samuil Marshak (including translations), collections of poems by Bulat Okudzhava, Inna Lisnyanskaya, Yuri Kublanovsky, Timur Kibirov and others.

All intellectual property used in the Corpus is available only for non-commercial use for research and teaching purposes. However, some users try to collect the whole Corpus by downloading output results, rather than using it as a source of examples of linguistic phenomena.

We want to limit the possibility of misuse of the Corpus, so we have changed some of the rules. Now unauthorized users can download no more than 1000 examples. 

If you want to download more examples, you need to log in to Corpus. For authorized users, the limit remains the same.

Also, please be reminded that there is an option to get an offline version of the Main and Syntactic corpora, as well as multilingual and diachronic datasets. Read more about how to do this in the article Оffline versions of the RNC.

On the New Year’s Eve, we would like to make a gift to our users and invite you to the Corpus Museum reconstructing the RNC interface of 2003!

At that time the Russian National Corpus included 20 million words. Simple search (form search) and advanced search (lemmas and tags search) were available in the corpus. A large group of linguists from Moscow, St. Petersburg and other Russian scientific centres took part in the development of the Russian National Corpus.

One of the inspirers and creators of the Corpus was Ilya Segalovich (1964 — 2013), co-founder and Chief Technology Officer at Yandex. Ilya developed the Corpus' original simple interface that now allows searching in the modern version of the Main Corpus.

On the upcoming Sunday, December 10, from 10:00 to 18:00 Moscow time, technical maintenance will be carried out on our servers.

Due to this, short-term interruptions in the website's operation may occur, which will last no more than 1 hour.

Speech collections within the Accentological and Spoken corpora were expanded. Transcripts of academic and political talks, TV and radio broadcasts, personal oral history, and everyday dialogic speech have been added. The size of the Spoken corpus amounts to 14 million tokens, the overall size of the Accentological corpus, the naive poetry collection included, is 134.8 million tokens.

The parallel corpus was expanded by 3 million tokens. New texts appeared within the language pairs of Czech, English, French, German, Portuguese, and Spanish with Russian. In particular, the English-Russian tier was updated with a collection of transcripts of public TED Talks, while the Portuguese-Russian subcorpus has almost doubled in size and now also includes texts created in Portuguese-speaking Africa.

In the Social Networks corpus, genres are automatically marked for all the text. Users can select one or more genres from the list. Several new genres have been identified such as picture captions.
Properties generated by NeuroRNC are marked with a special icon. If you notice an inaccuracy or error, feel free to report it using the “Report an Error” button in the same window.

You can specify a subcorpus within the Regional Media corpus by specifying intervals with day accuracy. For example, you can explore the use of the word милиция.

Within the same corpus, Graphs are now plotted using days, months or years as units of measurement. The default unit of measurement is now month. You can switch between days, months and years on charts. The option is available in the search results, and the functions of Get overview, Compare queries and Word at a glance.