Портрет корпуса

Corpora

637,051,318 texts
11,289,617,320 words

disambiguated

Corpora: GICR (VK) β

The Internet Сorpus of social media as part of the RNC

The Russian National Corpus includes a collection from the General Internet Corpus of Russian, VKontakte section (hereafter GICR VK). This is the largest corpus within the RNC, containing more than 11 billion word tokens. It is based on a segment of another Russian corpus project with a long history: the General Internet Corpus of Russian (hereafter GICR, or the General Corpus).

The GICR VK corpus, currently in beta, became part of the RNC through the joint efforts of the developers of both projects. The conceptual foundations of the General Corpus and the National Corpus differ, and the inclusion of GICR VK therefore expands the range of possible applications of the RNC.

The difference between the two projects is reflected, among other things, in their size: they are based on two different principles, representativeness and differentiality. These are two distinct approaches to increasing the reliability of corpus-based research; they do not contradict each other, but rather complement one another. Representativeness, the guiding principle of the National Corpus, means balanced genre coverage of texts: fiction, journalism, scholarly writing, everyday communication, and others. The differential approach is the guiding principle of the General Corpus. It focuses on identifying variation in language and involves annotating a large corpus that is substantially less diverse in terms of genre — in the case of GICR VK, a social media corpus — using a system of differential metatextual features, including sociolinguistic and other parameters. When statistically significant volumes of texts with different differential parameters are available, it becomes possible to identify biases in corpus output and to determine the distinctive features of sociolects.

The General Corpus as a whole includes several social media segments. The segment incorporated into the Russian National Corpus is based on VKontakte user posts from 2007 to early 2022. The total size of this corpus in GICR exceeds 15 billion words; the beta version included in the RNC is somewhat smaller, at 11.3 billion words. In addition to the date of writing, the texts are annotated with the author’s gender, age, place of birth, and place of residence: city, region, and country. These data correspond to the information provided in the user’s social media profile. At the same time, the texts are anonymized: users’ names or pseudonyms are not included in the corpus. Morphological homonymy in the corpus has been resolved using state-of-the-art technologies as of 2022, together with a dictionary used to improve lemmatization.

One characteristic feature of social networks, and VKontakte in particular, is the substantial share of “non-authored” posts and fake profile data. This problem is becoming increasingly significant as the proportion of posts generated fully or partly automatically continues to grow. The GICR developers have made considerable efforts to filter out such posts; however, when a text is only partly non-authored, detecting this property becomes much more difficult. This means that researchers using a large internet corpus should not rely blindly on quantitative data: additional analysis of search results is always advisable. Studies show that about 10% of the output may be irrelevant for one reason or another, and this level of confidence should be kept in mind.

The General Corpus is a valuable tool for studying diachronic, sociolinguistic, and geographical variation in 21st-century Russian. The texts included in the corpus represent different territories where Russian is used in writing — not only countries where Russian is widely used as a native or second language, but also, in effect, the entire world. The corresponding annotation is differentiated down to the regional level, that is, to the first-level administrative unit. As a result, statistical information is available on the geographical distribution of lexical and grammatical regionalisms and dialect features. Social media posts are dated to the exact month, making it possible to trace language change across a vast body of material and within microdiachronic intervals, and to identify the date and place of origin as well as the subsequent history of new borrowings, Russian neologisms, word-formation models, productive substandard constructions, and internet memes.

Accordingly, the corpus provides the Graph tool, with a one-month time step, and the Statistics tool showing the diachronic, age-related, gender-based, and geographical distribution of linguistic phenomena. It should be kept in mind that not all regional subcorpora are large or representative enough, and some “exotic” locations may be playful or fictitious in nature. For example, a high IPM value for a given administrative unit in South America is less informative than the same value for regions in East Slavic countries. Among texts for which the author’s gender is specified, 60% were written by women.

Creation of the General Corpus

The GICR project was developed over many years with the participation of students from the departments of computational linguistics at the Russian State University for the Humanities and the Moscow Institute of Physics and Technology.

Authors of the idea and academic supervisors:

Vladimir Belikov
Vladimir Selegey
Serge Sharoff

Programmers:

Nikolai Kopylov, lead programmer
Ilya Raskin, filtering
Maria Ponomareva, annotation
Yury Kuratov, statistics
Sergei Gladilin, adaptation to the RNC architecture
Anton Kazennikov, RNC search engine
Dmitry Morozov, integration into the RNC
Pavel Dyachenko, integration into the RNC

Project managers, in different years:

Tatyana Shavrina
Alexandra Ivoylova
Daniil Selegey
Anastasia Kozerenko

Publications

Check out the list of scientific publications on the Geneal corpus via the link: https://ruscorpora.ru/en/corpus/gicr/publications. To find other types of publications related to the corpus, use the filters in the "Publications" section.

Updated on 30.06.2026