RNC News

A new corpus has been added to the Russian National Corpus: the General Internet Corpus of Russian (VKontakte). It contains texts from the VKontakte social network covering the period from 2007 to early 2022. With the addition of the new corpus, the Russian National Corpus has grown by 11.3 billion word tokens. This has increased the total size of the RNC more than sixfold, from 2.2 to 13.5 billion word tokens.

A distinctive feature of the new corpus is the sociolinguistic annotation of texts: each text is assigned author-related parameters such as gender, age, and place of residence. This makes it possible to study regional, age-related, and gender-based features of Russian using very large volumes of data and to draw statistically significant conclusions about different varieties of the language.