The Structure of the RNC

The Russian National Corpus is a collection of individual corpora, each of which is assembled to tackle different linguistic tasks. Each of these collections of texts is large and representative, which makes them valuable for both quantitative and qualitative research. The specificity of linguistic tasks determines the structure of each corpus and the type of annotation used in it. For example, the poetry corpus is normally used for research focused on poetry, so it has a special annotation describing the key concepts of versification: meter and rhythm. The accentological corpus is a primary tool for studying the Russian word stress, which is a key feature of its annotation. The spoken corpus also comes with word stress and other parameters of sounding speech. In the multimedia corpus, the texts are accompanied by synchronized video or audio recordings, for several films, even gestures are annotated. In the syntactic corpus, all the sentences have a special complex annotation of their syntactic structure. The educational corpus features morphological and metatextual markup aligned with the Russian school curriculum.

In addition to texts in standard contemporary Russian, the RNC seeks to present the Russian language in its historical and geographical diversity. In paticular, there are several historical corpora within the RNC: separate collections of texts represent the Old East Slavic language (the common ancestor of Russian, Ukrainian and Belarusian languages, from the 11th to the 14th century), Middle Russian (the language of the 15th to the 17th centuries) and Old Church Slavonic in its Russian version. In another historical corpus, birch-bark letters of the 11th-15th centuries are collected. In addition, the main corpus includes texts of the 18th century, written even before Karamzin and Pushkin. These Early Modern Russian texts are far from being clear to contemporary readers (which can even be said of the language of classical Russian literature that is often obscure). We are preparing to launch a united search service in the historical and contemporary corpus, which will allow us to trace the history of a given word or grammatical construction throughout several centuries.

The dialect corpus includes oral texts recorded by speakers of traditional Russian dialects throughout Russia, in phonetic notation, preserving all the peculiarities of the vocabulary and grammar. The corpus of the regional press contains texts in standard Russian without much difference from those published in Moscow or St. Petersburg; nevertheless, local real life entities and local vocabulary can be traced there, too.

Most of the corpora included in the RNC are monolingual. One exception is the parallel corpus which includes both original Russian texts accompanied by their translations into other languages and texts originally written in other languages and translated into Russian. The RNC includes several dozens of language pairs and a multilingual corpus, where the same texts have translations into several languages. One of the historical corpora, the corpus of birch bark letters, is also a parallel one: the Old East Slavic text is accompanied by its translations into modern Russian and English. Finally, the RNC includes a multimedia parallel corpus, which presents English-language films in Russian translation and versions of the same theater plays in English and Russian.

Updated at