A corpus of texts is a set of electronically stored texts with advanced search tools offering a variety of options.
The Russian National Corpus was the first text corpus in the Russian language. Now it is the most balanced one (it contains texts of various genres approximately in the same proportion in which an ordinary native speaker encounters them) and has the greatest academic support (a large team of linguists is involved in the development of the corpus).
In an electronic library, one normally searches for books to read them online or to download them for offline reading. A text corpus is designed to search for usage examples of individual words or phrases. It is usually impossible to read or download full texts from a text corpus. On the other hand, all texts of the corpus are provided with linguistic annotation, which allows one to perform various complex search queries, impossible in an electronic library.
In Russian National Corpus, you can:
- find the earliest occurence of the word televidenie (1915!)
- find out which word was used more frequently in the 20th century: nado or nuzhno (nado occurs almost twice as often as nuzhno)
- find out when skuchat’ po nem was used (in the 1960s, such examples still occurred)
- find the most chararacteristic modifiers for the word xleb (nasushchnyj, rzhanoj. cherstvyj, pechenyj, pshenichnyj)
- see the top list of the verbs which can be repeated three times in a row (ljublju-ljublju-ljublju, spat’-spat’-spat’, shel-shel-shel)
- find out which author was the first to use the word volnitel’nyj (Leo Tolstoy)
- see what the word seledka rhymes with in Russian poetry (vodka, podmetka, podborodka, chetko, krotkij, lodka, skovorodka, seredka...)
- see which words were associated with the word sobes in the 20th century (zags, poliklinika, zhek, profkom — indeed, nobody called sobes a job interview then)
and do many other interesting things.
The total amount of texts indexed by Yandex or other search engines is much greater than that of the RNC. However, search engines are designed to enable their users to quickly find relevant information rather than to facilitate linguistic research. Search engines cannot provide an exact number of occurrences for a given word or phrase, one cannot search for general constructions that do not contain specific words, there is no exact information on who, when and where wrote and published each text. In contrast, the text corpus provides all these options.
The Russian National Corpus is being developed by two institutes of the Russian Academy of Sciences: Vinogradov Russian Language Institute and Kharkevich Institute for Information Transmission Problems, in cooperation with Yandex. In addition, a large team of linguists and software engineers from other academic organizations take part in the development of the RNC (see Project participants).
The RNC includes texts in Russian of various genres: fiction, poetry, newspapers, magazines, academic and technical texts, personal diaries and letters, transcripts of movies and recorded dialogues, etc.
The corpus is constantly updated and augmented. Currently it has more than 4 million texts, which in total contain almost one and a half billion words.
Texts are included into the Russian National Corpus irrespective of their compliance with the norms, as they are meant to reflect the rich diversity of the Russian language. Therefore, these texts may contain obsolete spellings, meanings and constructions, individual deviations from the norm, or even errors or misprints, if those were not noticed by editors or proofreaders at the time of the initial publication. Nowadays, it is not the text corpora that follow the norms, but rather the norms follow the corpora. Linguists make decisions to update the norms researching the text corpora that establish the current usage.
The Russian national corpus is developed in such a way that it stays relatively balanced in terms of genres and types of texts, and the addition of each new text is accompanied by laborious work on annotation. Therefore, you can send your suggestions to the makers of the corpus, but it cannot be guaranteed that any suggested text will be automatically added to the RNC.
The Russian National Corpus is a tremendous project, and like any other large text corpus it is not entirely error-free. There may occur typos in the original texts, OCR errors, incorrect parsing (e.g. due to the absence of a word in the dictionary or to incorrect annotation), inaccuracies in the meta-information about the texts. However, not everything that may seem an error is indeed one. For example, in texts with unresolved ambiguity, all permissible morphological parsings are given, some or which are wrong in the given context.
If you notice an error in the Russian National Corpus, we would be happy if you let us know. For spelling or parsing errors, left-click on the word and select Report an error, then in the window that opens below describe the error and click Send. For errors in the text meta-information, do the same thing after left-clicking on the title of the text.
Any error you notice will be considered by the makers of the RNC and may be corrected not immediately, but at the next re-indexing of the texts of the corpus. Re-indexation is carried out approximately twice a year.
Some of the features of the Russian National Corpus are released in beta mode, and the team of corpora developers needs feedback from the users of the corpora to refine and improve them.
Next to such features on the RNC website you will see the Rate button. To take part in beta testing, click this button, select a rating, then, if needed, add a comment which can help improve this feature, and click Submit.
Try and evaluate different ways to use the feature: run several queries, set different parameters, evaluate the feature in several corpora.
All data published under https://www.ruscorpora.ru/ are available exclusively for non-commercial use for research and educational purposes (in accordance with Article 1274 of the Civil Code of the Russian Federation). They are not intended to be read or viewed, copied, or used in any other way as texts proper: they can only be used as sources of examples (citations) illustrating a particular linguistic phenomenon. When quoting examples obtained in the RNC, it is necessary to refer to the RNC as the source of examples and to include the name(s) of the author(s) of the text and its title.
The easiest way is to write an email to: info@ruscorpora.ru
Updated on 23.07.2024