RNC News

The Panchronic corpus now takes into account recent additions to its constituent corpora, namely the Old East Slavic corpus and the Birchbark letters corpus. It also includes all the inscriptions from the new corpus of East Slavic inscriptions. The analysis of lemmas of the Middle Russian texts within the Panchronic corpus has been corrected and updated (about 3000 new lexemes). The table of correspondence between  lemmas and grammatical features of different historical periods has been corrected and supplemented with new data. Now these correspondences take into account parts of speech (for example, only the modern verb напасть, but not the homonymous noun, has the historical lemma напасти). Besides, within the Panchronic corpus it is now possible to customize a subcorpus by genre category of the text, depending on whether it belongs to the domain of literary texts, ecclesiastical, everyday, business or educational (one text can have several categories). This is important for studying the evolution of vocabulary and grammatical parameters that strongly depend on the genre.

The Regional corpus has been enlarged to 35.5 million tokens. It includes texts of 5 new newspapers and a large collection of Voronezh Oblast media prepared by the staff of Voronezh State University. These new texts were disambiguated and in the texts of the replenishment, grammatical homonymy was removed and syntactic annotation was introduced. The keywords to the texts were generated using the NeuroRNC language model.

The Poetry corpus now contains over one hundred thousand poems; the size of the corpus has grown by half a million tokens and is now close to 14 million. The works of ten poets have been added to the corpus. These are three volumes of poems by Samuil Marshak (including translations), collections of poems by Bulat Okudzhava, Inna Lisnyanskaya, Yuri Kublanovsky, Timur Kibirov and others.

Show all