What is the Corpus?

A corpus is a reference system based on an electronic collection of texts composed in a certain language. A national corpus represents that language at a stage (or several stages) of its development in all the variety of genres, styles, territorial and social variants of usage, etc.

A national corpus is created by linguists (specialists in corpus linguistics, a fast-developing discipline) for academic research and language teaching. Most of the major world languages have their own corpora. A well-recognized example is the British National Corpus, which is used as a model for many modern corpora. Among the Slavic languages, the Czech National Corpus (compiled at the Charles University of Prague) is notable.

A national corpus is distinguished by two features. Firstly, it is characterized by representative and well-balanced collections of texts. This means that such a corpus contains, if possible, all the types of written and oral texts present in the language (various genres of fiction, journalistic, academic, and business, as well as dialectal and sociolectal, texts). The proportion of text types in the corpus is based on their share in real-life usage at the time of composition. A representative corpus is necessarily a large one (containing up to several million tokens). The planned size of the Russian National Corpus is 200 million words.

Secondly, a corpus contains additional information on the properties of texts that are included. This is achieved by means of annotation. The annotation is a principal feature of the corpus, distinguishing the corpus from simple collections (also known as 'libraries') of texts on the Internet, such as, in Russian, the Maksim Moshkov library or the Russian Virtual Library. Such libraries are not well suited to academic work on the nature of language; they tend to focus on the content of texts rather than their language properties, while the creators of the Corpus recognize the importance of literary or scientific value of the texts, but see them as a secondary feature. Unlike an electronic library, the National Corpus is not a collection of texts which are deemed 'interesting' or 'useful' of themselves; the texts in the Corpus are interesting and useful for the study of language. Such texts might include not only great works of literature, but also works of a 'secondary' writer, or a transcription of an ordinary conversation.

The academic and teaching value of a corpus is dependent upon the variety of annotation. The Russian National Corpus currently uses four types of annotation: metatextual (information about the text), morphological, accentual and semantic; the introduction of syntactic annotation is planned for the near future. The system of annotation is constantly being improved.

The need for the corpus

The main purpose of the corpus is to facilitate academic research on the lexicon and grammar of a language, as well as the subtle but constant processes of language change within a relatively short period of time: from one to two centuries. The other purpose of the corpus is to serve as a reference point for lexical, grammatical, and accentological questions, and the history of the language. Modern IT-technologies make the processing of large volumes of text significantly simpler and faster, which creates the possibility for mass statistical analysis of texts. As a result, language research now yields results which could only be guessed at previously. Nowadays, truly scientific descriptions of grammars and academic dictionaries must be based on corpora of their respective languages. The use of corpus data is desirable (if not always strictly necessary) in other, more specialized language research.

Therefore, the main users of national corpora are linguists of various profiles. Nevertheless, the corpus is useful for non-linguists too. Reliable statistical information on language use in a certain period or by a certain author may be of interest for researchers of literature, history and other humanitarian subjects. National corpora are also useful for language teachers, both native and foreign; language textbooks and teaching programs are increasingly oriented toward corpora. A corpus can be used for ascertaining the variants of usage of unknown words by foreigners, students, teachers, journalists, writers. Therefore, the corpus is aimed at people who are interested in the structure and usage of a language, be their interest professional or not.

The development of the National Corpus

The Russian National Corpus covers primarily the period from the middle of the 18th to the early 21st centuries. This period represents the Russian language of both the past and the present in a wide range of sociolinguistic variants: literary, colloquial, vernacular, in part dialectal. The Corpus includes original (non-translated) works of fiction (prose, drama and poetry) of cultural importance which are interesting from a linguistic point of view. Apart from fiction, the Corpus includes a large volume of other sources of written (and, for the later period, spoken) language: memoirs, essays, journalistic works, scientific and popular scientific literature, public speeches, letters, diaries, documents, etc.

The Russian National Corpus includes the following subcorpora:

The Deeply Annotated corpus, containing sentences with full morphological and syntax structure markup,

The Parallel Corpora (English, German, Ukrainian, Belorussian, and multilingual), which facilitates searches for all translations for a certain Russian or non-Russian word of phrase,

The Dialectal corpus, which includes recordings of dialectal speech from various regions of Russia and represents dialectal morphological variations,

The Poetry corpus, which facilitates searches not only by lexical and grammatical features but also by specifically poetical features, such as meter, rhyme types, etc,

The Educational corpus, a corpus of texts with disambiguated grammatical homonyms, which was adapted for the Russian school teaching program,

The Corpus of Spoken Russian which includes the recordings of public and spontaneous spoken Russian and the transcripts of the Russian movies (1930-2007).

