The Structure of the RNC

The Russian National Corpus is a collection of individual corpora, each of which is assembled to tackle different linguistic tasks. Each of these collections of texts is large and representative, which makes them valuable for both quantitative and qualitative research. The specificity of linguistic tasks determines the structure of each corpus and the type of annotation used in it.

The main corpus is the most general reference corpus. It comprises prosaic texts created after 1700, printed, written, or (later) electronic. Its annotation includes no specific features. The same is true for the separate media corpus that is the largest one within the RNC. It consists of media publications that have appeared since 1980s.

Meanwhile, the poetry corpus is normally used for the more specific research of verse, so it has a special annotation describing the key concepts of versification: meter and rhythm. The accentological corpus is a primary tool for studying the Russian word stress, which is a key feature of its annotation. The spoken corpus also comes with word stress and other parameters of sounding speech. The social networks corpus is, in a sense, intermediate between a written and a spoken corpus. Here, texts are less bound by the restrictions of the standard norm, and use extensively emoticons (emoji), a specific sign subsystem. In the multimedia corpus, the texts are accompanied by synchronized video or audio recordings, for several films, even gestures are annotated. In the syntactic corpus, all the sentences have a special complex annotation of their syntactic structure. The educational corpus features morphological and metatextual markup aligned with the Russian school curriculum. The corpus From 2 to 15 is dedicated to the texts read by children and teenagers, automatically annotated by the presumed age of their audience. 

In addition to texts in standard contemporary Russian, the RNC seeks to present the Russian language in its historical and geographical diversity. In paticular, there are several historical corpora within the RNC: separate collections of texts represent the Old East Slavic language (the common ancestor of Russian, Ukrainian and Belarusian languages, from the 11th to the 14th century), Middle Russian (the language of the 15th to the 17th centuries) and Old Church Slavonic in its Russian version. In another historical corpus, birch-bark letters of the 11th-15th centuries are collected. In addition, the main corpus includes texts of the 18th century, written before Karamzin and Pushkin. These Early Modern Russian texts are far from being clear to contemporary readers. This can even be said of the language of classical Russian literature, featured in a separate corpus of Russian classics, that is often obscure. A united search service in the historical and contemporary corpus is launched, which allows us to trace the history of a given word or grammatical construction throughout several centuries: the panchronic corpus.

The dialect corpus includes oral texts recorded by speakers of traditional Russian dialects throughout Russia, in phonetic notation, preserving all the peculiarities of the vocabulary and grammar. The corpus of the regional press contains texts in standard Russian without much difference from those published in Moscow or St. Petersburg; nevertheless, local real life entities and local vocabulary can be traced there as well.

Most of the corpora included in the RNC are monolingual. One exception is the parallel corpus which includes both original Russian texts accompanied by their translations into other languages and texts originally written in other languages and translated into Russian. The RNC includes several dozens of language pairs and a multilingual corpus, where the same texts have translations into several languages. One of the historical corpora, the corpus of birch bark letters, is also a parallel one: the Old East Slavic text is accompanied by its translations into modern Russian and English. Finally, the RNC includes a multimedia parallel corpus, which presents English-language films in Russian translation and versions of the same theater plays in English and Russian.

Updated on