RNC News

The corpus of regional media is searchable for collocations. For this search mode a statistical approach is used. Collocations are combinations of words that occur together more often than by chance. Such statistical measures as Dice, Loglikelihood, t-score, MI3 and aggregated measure (geometric mean of t-score and MI3 measures) are used to calculate the collocations.

For more information about the new functionality, see here.

The parallel corpus counts 168.8 million tokens. The Czech, German, English, French, and Spanish language pairs have been expanded with new texts.

The Church Slavonic corpus is updated and comprises 5.2 million tokens. It partially includes the "Green Menaion" edition of 2002, using the civil (modernized) orthograpĥy. The Church Slavonic corpus features a more detailed metatextual annotation. All the texts are annotated by date of publication. Texts of the Modern era are provided with dates and authorship, and the recent (beginning with the 18th century) liturgical texts also feature information about their drafting and approval.

The Middle Russian corpus has been expanded to 8.8 million tokens. Among the added text is volume of the "Library of Literature of Old Rus", dedicated to the 17th century (prose stories and songs), the earliest texts from the Letters and "Papers by Peter the Great", as well as  the 16-century "Embassy book on relations with the Crimean Khanate". The morphological annotation of the texts previously included in the corpus has been corrected and updated.

Show all