Новости НКРЯ

05.03.2024

In February, we significantly improved the National media corpus.

It was updated with new texts counting 49,6 million tokens. These are printed media from the 1990s ("Nezavisimaya Gazeta", including weekly supplements, "Moskovsky Komsomolets", and "St. Petersburg Vedomosti").

In all the texts of the corpus, grammatical homonymy has been automatically resolved and annotation of syntactic relations (starting from the second token upon clicking "add condition") has been completed. Thus all the latest functions that are already available in the Main and Regional media corpora, such as searching by syntactic relations and properties, collocation search, frequency dictionary, frequency of query results, are also searchable within the National media corpus, the largest of the three.

The RNC Media corpus is now the world's largest Russian online corpus with the ability to search by syntactic relations!

In the form of the subcorpus, it is now possible to select texts by topic and type. For annotation of these fields RuRoBERTa model is used, further trained on the Regional Corpus data. Fields in the form of subcorpus and text information with values generated by NeuroRNC, are marked with a special icon. Errors are possible with automatic annotation. There is a "Report an error" button in the text information pop-up window. Please inform us of any inaccuracies or errors in the definition of topics and types.

13.02.2024

The Russian Classics corpus was expanded by more than 1 million tokens. Complete works by Alexander Radishchev and Ivan Krylov were added, as well as some texts by authors already represented in the corpus that had been omitted in the previous release of the corpus. The function of diachronic graphs is available, queries can be compared, and subcorpus can be customized by both genre and date. The search results can now be sorted by date of creation and by author and genre.

13.02.2024

The Сorpus portrait now features diachronic statistics for the Main and Regional corpora. Distribution of the corpus size and metatextual parameters by creation date is represented on a graph. Within the Regional corpus, distribution of texts by countries and regions is also available diachronically.

To see diachronical graphs, click the (i) button in the corpus header, select Statistics and navigate to Diachronic statistics.
The user may specify smoothing, date range, and distribution (axis step), the choice is applied at once to all the graphs on the page.

25.01.2024

Syntactic relations are available for search within the Lemma and Tags search forms in the Main and Regional corpora. Both syntactic features of a word and a syntactic relation between any two words can be specified. The Syntactic relation field is available in the Lemmas and tags search if the search form features more than one word, starting from the second one. To see it, click “Add condition”. The new functionality, for example, allows us to determine students of what educational institutions' are mentioned most frequently in the Main corpus.

Please note that syntactic annotation is featured within the RNC in two different formats, namely within the Syntactic corpus (the SynTagRus format) and within the Main, Educational, and Regional (the Universal Dependencies format). When switching between corpora with different syntactic annotation, syntactic parameters are not preserved within the search query.

For more information about the syntactic annotation in the RNC, see the Syntactic annotation page.

29.12.2023

Following the tradition, on the last working day of the departing year, the RNC team reflects on our achievements and introduces the latest additions to the Corpus.

In 2023, we did a lot of new things, including the introduction of a new service called “Word at a glance”, new models for automatic annotation named NeuroRNC, the complete overhaul of our user interface, and the introduction of new corpora and new tools for analysis and visualization.

We hope that in this picture everyone will discover tools that will enhance your experience and productivity while working with the RNC. May the New Year bring you many interesting discoveries!

Wishing you a prosperous New Year,

The RNC team

27.12.2023

In December 2023, our team completed a large-scale project to migrate the RNC website to a new user interface. The project started in 2022 and users were introduced to the newly designed main page of https://ruscorpora.ru for the first time in May of last year. Throughout the project, we progressively implemented the new search interface for all corpora, as well as introduced various innovations and improvements to help our users complete their routine tasks faster and more efficiently.

Some of the key features of the new interface include:

1. Accessibility from mobile devices and the option to switch to the English version.

2. Features overview service for introducing the new interface to new audiences and highlighting the latest innovations.

3. Portraits of corpora, subcorpora, and words at a glance, providing users with diverse perspectives on the information.

4. Extensive visualization capabilities to visually represent complex information.

5. Quick access to standard tasks, such as shortcuts to main functionalities from the main page, information about the query and subcorpus in the page header, saving user preferences, short links to share results, and more.

For further details on these and other tasks that the new interface solves, read the article.

26.12.2023

Syntactic Corpus (SynTagRus) is now available in a new interface!

Users can search for Еxact forms and Lemmas and tags search. In the Lemmas and tags search form, a compound field "Syntactic relation" has been added, where the user can specify which word the current word is related to, select its role (depends/controls) and the type of relation. In the compound field “Lexical function”, the user can specify which word and which lexical function the current word is related to, select its role in the relation (argument/value) and the functional word. For example, by setting the lemma вести as the first word in the lexico-grammatical search, and selecting the lexical function OPER1 as an argument as the second word, you will see what you can вести (прием, переговоры, кампанию, дневник).

Two types of output are available in the corpus, concordance and KWIC. By clicking on the icon “Show the structure” or “Show structure with multiword expressions represented word by word”, the user can view the syntactic structure of a sentence in the form of a dependency tree.

The morphological and syntactic annotation of the Syntactic corpus differs slightly from the basic morphological and syntactic standard of the RNC. More details on the types of annotation can be found in the Corpus portrait and the Types of Annotation section.

The Syntactic corpus has also been enriched by 28 thousand tokens.

26.12.2023

The Panchronic corpus now takes into account recent additions to its constituent corpora, namely the Old East Slavic corpus and the Birchbark letters corpus. It also includes all the inscriptions from the new corpus of East Slavic inscriptions. The analysis of lemmas of the Middle Russian texts within the Panchronic corpus has been corrected and updated (about 3000 new lexemes). The table of correspondence between lemmas and grammatical features of different historical periods has been corrected and supplemented with new data. Now these correspondences take into account parts of speech (for example, only the modern verb напасть, but not the homonymous noun, has the historical lemma напасти). Besides, within the Panchronic corpus it is now possible to customize a subcorpus by genre category of the text, depending on whether it belongs to the domain of literary texts, ecclesiastical, everyday, business or educational (one text can have several categories). This is important for studying the evolution of vocabulary and grammatical parameters that strongly depend on the genre.

The Regional corpus has been enlarged to 35.5 million tokens. It includes texts of 5 new newspapers and a large collection of Voronezh Oblast media prepared by the staff of Voronezh State University. These new texts were disambiguated and in the texts of the replenishment, grammatical homonymy was removed and syntactic annotation was introduced. The keywords to the texts were generated using the NeuroRNC language model.

The Poetry corpus now contains over one hundred thousand poems; the size of the corpus has grown by half a million tokens and is now close to 14 million. The works of ten poets have been added to the corpus. These are three volumes of poems by Samuil Marshak (including translations), collections of poems by Bulat Okudzhava, Inna Lisnyanskaya, Yuri Kublanovsky, Timur Kibirov and others.

26.12.2023

All intellectual property used in the Corpus is available only for non-commercial use for research and teaching purposes. However, some users try to collect the whole Corpus by downloading output results, rather than using it as a source of examples of linguistic phenomena.

We want to limit the possibility of misuse of the Corpus, so we have changed some of the rules. Now unauthorized users can download no more than 1000 examples.

If you want to download more examples, you need to log in to Corpus. For authorized users, the limit remains the same.

Also, please be reminded that there is an option to get an offline version of the Main and Syntactic corpora, as well as multilingual and diachronic datasets. Read more about how to do this in the article Оffline versions of the RNC.

15.12.2023

On the New Year’s Eve, we would like to make a gift to our users and invite you to the Corpus Museum reconstructing the RNC interface of 2003!

At that time the Russian National Corpus included 20 million words. Simple search (form search) and advanced search (lemmas and tags search) were available in the corpus. A large group of linguists from Moscow, St. Petersburg and other Russian scientific centres took part in the development of the Russian National Corpus.

One of the inspirers and creators of the Corpus was Ilya Segalovich (1964 — 2013), co-founder and Chief Technology Officer at Yandex. Ilya developed the Corpus' original simple interface that now allows searching in the modern version of the Main Corpus.

RNC News