Портрет корпуса

Corpora

141,035 texts
384,096,728 words

historical, disambiguated

Corpora: Panchronic

Panchronic corpus

Corpus architecture

The panchronic corpus integrates texts from four historical corpora, namely the Old East Slavic, Inscriptions, Middle Russian, and the Birchbark letter corpus, as well as the Main corpus, the earliest texts of which date back to the turn of the 17th and 18th centuries. From the point of view of size it is dominated by the texts from the Main corpus (more than 300 mullion tokens), but 70% of its chronological range falls on the historical corpora.

Within the panchronic corpus the user may build relevant queries for several centuries of Russian linguistic history. These include "preposition po with locative case", "history of the noun zabava", "distribution of verbs of motion with abstract nouns as subjects", "proper names in -slav". These queries relate to the bulk of the historical and modern texts, without typing in four queries within the interface of every corpus one by one.

The need for a panchronic corpus is dictated primarily by the different principles of representing lemmas, orthography and grammar in different historical and contemporary corpora. Because of this factor, fully automatic transfer of the query from one corpus to another can only be implemented to a very limited extent. In addition, a number of texts are included in one or the other of these corpora without clear chronological boundaries. The dates 1400 as a boundary between Old East Slavic and Middle Russian period and 1700 as a boundary between Middle Russian and Main/Modern corpus can in practice be observed only with a large degree of conventionality. For the Birchbark letters and Inscriptions corpora that end in the second half of the 15th century, such division is inapplicable. The dates of a number of texts is not clearly defined and can be timespans crossing these boundaries.

The Panchronic corpus does not cancel the existing separate historical and modern corpora. They are preserved in their entirety with their characteristic grammatical, metatextual, and other levels of annotation, as well as in their respective orthographic modes. They are regularly updated, and the Panchronic corpus is synchronized with these updates.

In the Panchronic corpus, a lemma can be annotated in the normalized Early Old East Slavic (сълати), Late OES/Middle Russian (слати) and Modern Russian (слать). In the texts stemming from the historical corpora, all the later, even if only theoretically constructed, dictionary forms are also indicated for all the lexemes. For example, in the Birchbark letters corpus the later lemmas продажникъ and продажникъ are marked, and the verb крити 'to buy' has a new conventional form крить, although this word disappeared already in the Middle Ages. Earlier forms are marked only if these words in fact occur in an older historical corpus: for example, the word президент, attested in 17th century texts, has an earlier lemma with final ъ, while, say, the word компьютер does not. With the addition of historical corpora the number of lemmas that get earlier variants in the markup will increase.

Сorrespondences between lemmas take into account parts of speech (for example, only the modern verb напасть, but not the homonymous noun, has the historical lemma напасти).

The set of grammatical features within the Panchronic corpus is different from the annotation of individual corpora. Characteristics that are tagged only in some corpora (for example, prepositional government, aspect, countable form) are usually excluded from it or their treatment is unified.

In the texts of the Panchronic corpus the grammatical homonymy is resolved. Only in the Old East Slavic, Birchbark, and Inscriptions corpora, as well as in a 6-million subcorpus of the Main corpus is this done manually. In the Old East Slavic corpus and the majority of the texts of the Main corpus, the attribution of lemmas and grammatical features is done automatically by neural network mechanisms, and this annotation has some errors.

Semantic features are marked in the historical texts in accordance with the semantic classes of the corresponding words (etymological cognates) in the modern Russian language. Since lexical semantics is subject to historical change, and a number of words are lost in the modern language and thus absent from a modern semantic dictionary, the semantic labeling of historical texts may be inaccurate and incomplete, and should be treated with caution. Nevertheless, a high coverage of historical texts with semantic markup obtained by this approach and the stability of the semantic classes of the majority of the lexemes compensates for the inevitable drawbacks of such annotation.

Later, the Poetry corpus beginning from 1700 is also to be included into the Panchronic corpus.

Searching the corpus

In the panchronic search, a lemma can be specified in normalized Early Old East Slavic (сълати), Late Old East Slavic/Middle Russian (сълати), or in the modern Russian form (слать). Only the letters of the modern alphabet plus ѣ are used in historical lemmas spelling. They are searched on a par within the same query field. For Cyrillic numerals in the historical corpus (e.g. ·е҃·) in the panchronic search Arabic numerals are used (e.g. 5).

Within the Panchronic corpus, word forms in texts and their sequences are found when searching by word forms, e.g. сам еси. The word forms can be typed in their original spelling. But in order to find more examples from different periods please enter the word forms in the modernized spelling, without the use of historical letters, final Ъ, titlos or brackets. For the query сам еси the use will find the spellings самъ ѥси, са(м҃) еси, самъ ес[и]. An asterisk (*) can be used to indicate any word or its part. To exclude a word in a certain position a minus sign (-) is placed before it. For example, -в городе finds the form городе or городѣ not after the word в or въ.

In the Panchronic corpus, one may sort results according to two parameters: the date of creation of the original, the making of the (surviving) copy or resp. publication. Sorting by date of the copy is relevant for those linguistic features (orthography, phonetics, some morphological elements) which could have been introduced by the copyists or editors and do not belong to the epoch when the text was created.

It is possible to define a subcorpus, in addition to the date, also by the genre category of the text. The following categories are available: ecclesiastical, everyday life and letters, learning and academic, literary texts, official and business, varia. They are based on the genre annotation of the component corpora. Given the changing sociocultural situation over thousands of years, assigning a text to one of these generalized categories may be tentative. A single text may be assigned to several categories.

Normalized frequency graphs (the number of instances per million) can be plotted throughout the chronological range of the query. If there are examples in the output older than the 12th century, the default graph is plotted only starting from 1100: the amount of the 11th century texts may be insufficient for statistically representative data.

Building the corpus

Developers of the historical corpus:

Sergei Gladilin, Dmitri Morozov, Viktor Sizov, and Irina Vinogradova (software architecture of the corpus, search implementation)
Оlga Lyashevskaya (neural network model of homonymy removal with postcorrection within the main and Old East Slavic corpus)
Dmitri Sitchinava (general corpus concept; algorithms and checking of Early and Late Old East Slavic lemmas and grammatical features correspondence, historical corpus orthographies; testing and further development of the concept and annotation)
Тimofei Arkhangelsky (constructing correspondences between Late Old East Slavic / Middle Russian and Modern Russian lemmas)
Аnton Dyshkant (algorithmization and construction of lemma correspondence tables for each corpus)

Updated on 22.07.2024