Портрет корпуса

Corpora

35,960 texts
26,238,321 words

historical, disambiguated, syntactically tagged

Corpora: Russian classics β

What is classics?

The corpus includes fictional, journalistic, and epistolary works, as well as business documents from the collected works of Russian classical writers.

At present, the corpus contains Russian-language texts by the following 16 authors, listed in chronological order:

Alexander Radishchev
Ivan Krylov
Vasily Zhukovsky
Alexander Griboyedov
Alexander Pushkin
Yevgeny Baratynsky
Mikhail Lermontov
Fyodor Tyutchev
Nikolai Gogol
Ivan Turgenev
Nikolai Nekrasov
Fyodor Dostoevsky
Mikhail Saltykov-Shchedrin
Leo Tolstoy
Nikolai Leskov
Anton Chekhov

The concept of a "classical author" is, of course, to some extent subjective, and is shaped by informal consensus, as well as literary scholarship, educational practices, and publishing traditions. This consensus may shift over time for various reasons. For example, Krylov’s fables had already become school classics during his lifetime, whereas Baratynsky’s poetry was only rediscovered by modernists; during the 1930s–1950s, Dostoevsky was not considered a "great writer" in the Soviet Union due to ideological reasons, and Leskov was not widely seen as a major literary figure until the final decades of the 20th century.

One informal marker of “classical” status might be the absence of a label such as “poet” or “prose writer” in the entries of the authoritative biobibliographical dictionary Russian Writers. 1800–1917, which began publication in 1989. For most of the above-listed figures, such labels are omitted — implying that the target audience is presumed to have at least a general familiarity with the works of authors like Pushkin or Shchedrin.

What matters for our purposes is that these authors have complete or near-complete collected works published in the 20th–21st centuries, with a focus on maximally comprehensive coverage across various genres, and the exhaustive publication of both printed and manuscript versions.

Naturally, not every creative work by a “classical” author is itself a "classic.” For instance, Nekrasov’s plays and novels, or Griboyedov’s writings other than Woe from Wit, are little known to general audiences and have not influenced literature or language on a scale comparable to widely acknowledged masterpieces. However, for the study of an author’s language and style, every line matters (see more on this below).

Corpus objectives

The "Russian Classics" corpus holds a special status within the Russian National Corpus (RNC). On the one hand, it is a historical corpus, featuring works by authors from the late 18th to the 19th century. The most recent author by birth date included is Anton Chekhov (1860), and the latest texts were written in 1910, the final year of Leo Tolstoy’s life. Naturally, the Russian language has changed since then—much of what the classics wrote may seem outdated or unclear to a modern speaker without commentary, and their language cannot be regarded as “modern (Standard) Russian” without important qualifications.

Nevertheless, these texts remain highly relevant to the standard of Russian literary language and occupy a central place in its development. Both normative and descriptive grammars, as well as people’s intuitive sense of language norms, often refer to them. If we define the "literary language" as that which has been “refined by masters,” then the texts of those masters form the core of the Russian literary/standard language corpus. This corpus can (with certain caveats) be consulted as a normative, rather than merely usage-based, resource. It provides authoritative examples for academic grammars, dictionaries, and teaching materials.

Including all these texts in the Main Corpus would be a controversial decision, as it would disrupt its balance by genre and authorship. Tolstoy’s collected works alone account for about 7 million word usages, which would make up nearly 2% of the Main Corpus—a substantial figure. Including the entire Tolstoy collection would introduce a significant skew toward a single author. Furthermore, a large portion of the Russian classics' collections consists of literary works, which the RNC traditionally limits to no more than 40% of the Main Corpus. In this specialized corpus, however, balance by genre, author, or date is not a priority, whereas completeness is. Finally, the creators of the Main Corpus tend to exclude drafts and edited versions of texts, which may nonetheless offer valuable linguistic material.

Therefore, the goal of this corpus is to present the legacy of Russian classical literature as fully as possible in the RNC, without the restrictions of the Main Corpus—gradually turning it into a corpus of 19th- to early 20th-century Russian literary language.

Since the purpose is to collect the complete body of works—not only literary texts, but also official, domestic, and other types of documents—by classical Russian authors, the text annotation is intentionally minimalist, including only a basic set of metadata used across all RNC corpora: general information such as author, title, date, genre, and prose/poetry, along with morphological and semantic annotation. Verse-specific annotation is not included here but can be found in a dedicated poetic corpus.

Another useful feature of this corpus is the ability to search within individual authors' works, which are more fully represented here than in the Main Corpus. Searches can also be limited to individual works, making it possible to identify idiolectal (author-specific) stylistic features and refine insights into their lexical and syntactic preferences. This goal—studying the author’s idiolect—justifies the inclusion of the author’s entire legacy, including letters and business documents, even where no artistic intent was present.

For example, the word полузавядший 'half-withered', characteristic of Turgenev, also appears in one version of Tolstoy’sYouth. The phrase она немедленно же 'she immediately then', typical of Leskov, is a distinctive marker of his individual style.

Sources and Functionality of the Corpus

To compile the corpus, preference was given to digitized complete collected works available through online libraries, in particular rvb.ru and feb-web.ru. Some of the most representative Soviet-era editions used for authors such as Zhukovsky, Gogol, and Leskov were not complete—often due to ideological reasons. The corpus also features Leskov’s novel At Daggers Drawn, which was omitted from the official 11-volume Soviet collection but is available via rvb.ru. The edition of Yevgeny Baratynsky used in the corpus includes only his poetry. The texts of Leo Tolstoy and Anton Chekhov were converted from specialized digital collections dedicated to those authors.

Editorial translations from foreign languages are not included. However, the corpus does include texts primarily written in a foreign language if they contain non-trivial Russian words and expressions or draft versions in Russian.

The corpus is currently in beta version, and some errors in conversion and OCR may be present. Corrections and new additions are planned. The current size of the corpus is approximately 26 million word usages.

By default, search results are sorted chronologically, from earlier to later texts. Users can also sort by author name, and within each author's results, by genre and title.

Key functionalities include:

Diachronic frequency charts
Comparative frequency plotting for multiple queries
Subcorpus selection by author, genre, or prose/poetry
Metadata comparisons and frequency dictionaries
Access to output formats such as “Statistics,” “Frequency,” and “N-grams”, similar to those in the Main Corpus
A “Word at a Glance” feature, with tools like “Word Sketches” and “Similar Words”

The “Similar Words” tool is available not only for the entire corpus, but also for the works of nine authors whose collected texts are substantial enough: Chekhov, Dostoevsky, Gogol, Leskov, Nekrasov, Pushkin, Saltykov-Shchedrin, Tolstoy, and Turgenev.

This widget allows users to compare how a word is used in the individual styles of different writers. While the automatically calculated associated words may not always be informative (especially if the word in question occurs rarely or in too varied contexts), they often reveal strikingly individual patterns. For instance:

The word страсть 'passion' in Pushkin has a generally positive connotation (appearing with words like 'beauty' and 'freedom'), whereas in Tolstoy it carries a strongly negative charge (associated with 'lust' and 'malice');

The word лошадка 'little horse' in Leskov refers to domestic life, while in Chekhov, it is used as one of the nicknames for his wife, Olga Knipper.

These features make the corpus a powerful tool for stylistic analysis, idiolect research, and deep exploration of the lexical choices of Russian classical writers.

Creating the corpus

The task of creating the corpus is being carried out by

Boris Orekhov (general concept of the corpus; collection of texts, program processing)
Maria Satina (additional metadata markup)
Dmitri Sitchinava (manual proofreading, program processing, additional metadata markup)
Pavel Dyachenko (search realization)
Alexey Polyakov (preparation of Gogol's texts)

Publications

Check out the list of scientific publications on the "Russian classics" corpus via the link: https://ruscorpora.ru/s/boKPL. In the Publications section, use filters to find other types of publications about the corpus.

Updated on 25.03.2025