Corpus architecture
The panchronic corpus integrates texts from four historical corpora, namely the Old East Slavic, Inscriptions, Middle Russian, and the Birchbark letter corpus, as well as the Main corpus, the earliest texts of which date back to the turn of the 17th and 18th centuries. From the point of view of size it is dominated by the texts from the Main corpus (more than 300 mullion tokens), but 70% of its chronological range falls on the historical corpora.
Within the panchronic corpus the user may build relevant queries for several centuries of Russian linguistic history. These include "preposition po with locative case", "history of the noun zabava", "distribution of verbs of motion with abstract nouns as subjects", "proper names in -slav". These queries relate to the bulk of the historical and modern texts, without typing in four queries within the interface of every corpus one by one.
The need for a panchronic corpus is dictated primarily by the different principles of representing lemmas, orthography and grammar in different historical and contemporary corpora. Because of this factor, fully automatic transfer of the query from one corpus to another can only be implemented to a very limited extent. In addition, a number of texts are included in one or the other of these corpora without clear chronological boundaries. The dates 1400 as a boundary between Old East Slavic and Middle Russian period and 1700 as a boundary between Middle Russian and Main/Modern corpus can in practice be observed only with a large degree of conventionality. For the Birchbark letters and Inscriptions corpora that end in the second half of the 15th century, such division is inapplicable. The dates of a number of texts is not clearly defined and can be timespans crossing these boundaries.
The Panchronic corpus does not cancel the existing separate historical and modern corpora. They are preserved in their entirety with their characteristic grammatical, metatextual, and other levels of annotation, as well as in their respective orthographic modes. They are regularly updated, and the Panchronic corpus is synchronized with these updates.
In the Panchronic corpus, a lemma can be annotated in the normalized Early Old East Slavic (сълати), Late OES/Middle Russian (слати) and Modern Russian (слать). In the texts stemming from the historical corpora, all the later, even if only theoretically constructed, dictionary forms are also indicated for all the lexemes. For example, in the Birchbark letters corpus the later lemmas продажникъ and продажникъ are marked, and the verb крити 'to buy' has a new conventional form крить, although this word disappeared already in the Middle Ages. Earlier forms are marked only if these words in fact occur in an older historical corpus: for example, the word президент, attested in 17th century texts, has an earlier lemma with final ъ, while, say, the word компьютер does not. With the addition of historical corpora the number of lemmas that get earlier variants in the markup will increase.
Сorrespondences between lemmas take into account parts of speech (for example, only the modern verb напасть, but not the homonymous noun, has the historical lemma напасти).
The set of grammatical features within the Panchronic corpus is different from the annotation of individual corpora. Characteristics that are tagged only in some corpora (for example, prepositional government, aspect, countable form) are usually excluded from it or their treatment is unified.
In the texts of the Panchronic corpus the grammatical homonymy is resolved. Only in the Old East Slavic, Birchbark, and Inscriptions corpora, as well as in a 6-million subcorpus of the Main corpus is this done manually. In the Old East Slavic corpus and the majority of the texts of the Main corpus, the attribution of lemmas and grammatical features is done automatically by neural network mechanisms, and this annotation has some errors.
Semantic features are marked in the historical texts in accordance with the semantic classes of the corresponding words (etymological cognates) in the modern Russian language. Since lexical semantics is subject to historical change, and a number of words are lost in the modern language and thus absent from a modern semantic dictionary, the semantic labeling of historical texts may be inaccurate and incomplete, and should be treated with caution. Nevertheless, a high coverage of historical texts with semantic markup obtained by this approach and the stability of the semantic classes of the majority of the lexemes compensates for the inevitable drawbacks of such annotation.
Later, the Poetry corpus beginning from 1700 is also to be included into the Panchronic corpus.