About the RNC

The Russian National Corpus covers the period from the first East Slavic manuscripts of the 11th century to the first decades of the 21st century. It represents both the language of previous eras and the modern Russian, in its sociolinguistic varieties: standard, substandard, colloquial, dialectal. The RNC includes, in particular, fiction texts (prose, poetry, drama, recorded movie dialogues) of a cultural as well as linguistic significance. But RNC is by no means only a corpus of belles-lettres or a model of exemplary language. The collection of the texts represents speech genres in all their diversity: memoirs, essays, journalism, popular science and academic literature, recordings of public speeches and private conversations, letters, messages, diaries, blogs private documents, amateur poetry, etc.

For more details, see Structure of the RNC.

Our team

The project involves specialists from the Vinogradov Russian Language Institute [IRL RAS], HSE University [HSE], Kharkevich Institute for Information Transmission Problems [IITP RAS], Institute for Linguistic Studies [IL RAS] in St. Petersburg, Voronezh State University. Yandex has been providing institutional and software support from the very start of the project. Representatives of many other institutions as well as independent researchers, students and graduate students, and volunteers have participated in the project at different times. Detailed information about the project staff who took part in the development of the RNC throughout its history can be found on Our Team page.

For more information about the early stage of the history of the RNC, see Dmitry Sichinava, The National Corpus of the Russian Language: An Outline of Prehistory, published in The Russian National Corpus: 2003-2005. Results and prospects (Moscow, 2005).

The RNC software

The platform of the Russian National Corpus includes tools used for preparing and indexing the texts and for searching the corpora.

The corpora managers regularly prepare data for updating and augmenting the corpora using metatextual and grammatical annotation software, which includes a large set of specialized tools for each of the corpora. For more information about the principles and annotation tools, see Structure of the RNC.

For each update, the texts prepared by linguists undergo automated indexing using MyStem (developed by Yandex) for the Russian language, taking into account the various widespread non-standard grammatical and spelling variants. Also, special versions of the MyStem program for other languages are used for the parallel corpora and for the Russian semantic dictionary.

A number of corpora are additionally processed by Rubic, which allows you to remove grammatical homonymy (to identify the preferred grammatical tags) and get a syntactic analysis of sentences.

To identify similar words, morphemic structure and annotating keywords and genres, language models of the NeuroRNC family are used.

For online search in the RNC, Elastic Search and Yandex Search are used, as well as additional plugins developed for linguistic search.

Software architects and developers working in the institutions on the RNC team have been taking part in the creation and improvement of the platform.

 

Grants

In 2020–2023, the RNC has been developed under the financial support by the Ministry of Science and Higher Education of the Russian Federation within the Agreement No 075-15-2020-793 “A new generation linguistic software platform for digital documentation of the Russian language: infrastructure, resources, academic research”.

In 2015–2021, the RNC was granted financial support by several foundations: the Russian Foundation for Basic Research (RFBR), the Russian Humanitarian Scientific Foundation (RHSF), the Russian Science Foundation (RSF) and the Department of Historical and Philological Sciences of the Russian Academy of Sciences. Below there is a list of grants given for the RNC as a whole or for several major corpora. Grants given for individual smaller corpora within the RNC are listed in the Structure of the RNC, in the subsections about the corresponding corpora.

  • RHSF project No. 15-04-12018 “Development of specialized modules of the RNC”    
  • RFBR project No. 17-29-09154 “Trends in the development of language system: a corpus-based study of synchronic variation and diachronic change in different text types”
  • RFBR project No. 19-07-00842 “Development of a corpus of Russian texts with morphosyntactic, lexical functional, anaphoric and temporal annotation” 
  • Fundamental Research Program of the Section of Literature and Language at the Department of Historical and Philological Sciences of the Russian Academy of Sciences "Language and Information Technologies" (2015–2017)
  • Comprehensive Program of Fundamental Research of the Section of Literature and Language at the Department of Historical and Philological Sciences of the Russian Academy of Sciences "Heritage of Eurasia and its modern meanings" (2015–2017)
  • Fundamental research program of the Presidium of the Russian Academy of Sciences "Monuments of material and intellectual culture in the modern information environment" (2018)

In 2011–2014, the RNC was supported by a Research program of the Presidium of the Russian Academy of Sciences “Corpus linguistics” No. 36-P.

Earlier, in 2003–2010, the RNC was supported by

  •  the Department of Historical and Philological Sciences of the Russian Academy of Sciences within the programs "Philology and Informatics" (2003-2006), "Russian Language, Literature and Folklore in the Information Society: Formation of Electronic Scientific Funds" (2006-2009), "Genesis and Interaction of Social, Cultural and Linguistic Communities", "Text in Interaction with Sociocultural Environment: Levels of Historical, Literary and Linguistic interpretation”
  • the Presidium of the Russian Academy of Sciences within the program "Historical and cultural heritage and spiritual values of Russia" (2009-2012)
  • the Russian Humanitarian Scientific Foundation (grants No. 03-04-00226a, 06-04-03817v, 06-04-03818v, 08-04-12127v, 09-04-12159v, 15-04-12018v)
  • the Russian Foundation for Basic Research (grants No. 06-06-80133a, 08-06-00371a, 15-06-04334a)
  • the Federal grant program "Russian language" of the Federal Agency for Education (state contracts No. 1028, 890, 608 signed on 14.12.2006, No. 219 signed on 18.06.2007, No. 66 signed on 11.04.2008)

Updated on