News

02.02.2023

The functionality of the main corpus has been significantly upgraded. Now it features lexical and grammatical annotation with automatic homonymy resolution and automatic syntactic annotation. Within the main corpus grammatical homonyms are disambiguated. It is also searchable by syntactic parameters such as types of compound sentences, predicative phrases (clauses), complements, copulas, and many others. With this new annotation, all the new functions that appeared earlier in the corpus of regional media are available in the main corpus: Searching for collocations, Frequency dictionary, Search results types. Frequency.

In addition, the main and newspaper corpora now are searchable for lemmas and word forms using regular expressions (β-version). They feature corpus and subcorpus statistics: overall size in texts and words, a geographical map (for the Regional media corpus only) and charts of metatextual attributes. These functions allow users to compare a given subcorpus with the bulk of the corpus, including the visualization.

The interface of the Church Slavonic corpus has been substantially updated, and the corpus is connected to the Overview feature.

The multimedia corpus counts 5.7 million tokens.
The parallel corpus counts 168 million tokens. Now it features four new language pairs with Russian, namely two larger South Slavic subcorpora, Serbian and Slovene, as well as two smaller pilot corpora of Korean and Hindi, both coming with transliteration and dictionary support. The Korean and Hindi tiers include aligned poetical texts, a new feature within the parallel corpus. The Czech and Spanish language pairs are also updated.

10.01.2023

The interface of the Middle Russian corpus has been significantly updated; the corpus is connected to the Overview feature.

The Regional corpus has a new type of output: Frequency. With it, the statistical distribution of search results by lemmas, word forms and a set of grammatical features can be analyzed. The frequency is calculated based on texts with automatically removed homonymy over a random subsample of 1 million search results. Users can control the confidence level to compare frequency confidence intervals.

The Dialect corpus has been updated and now contains 604 thousand tokens.
The SynTagRus has increased by 30 thousand tokens.

The corpus and subcorpus frequency dictionaries now feature 500 top lemmas rather than 100.

30.12.2022

The RNC sums up the results of 2022. There have been many changes this year: the size of the whole Corpus reached 1.5 billion tokens, two new subcorpora were inaugurated (Panchronic and "From 2 to 15"), the corpus of birchbark letters went parallel, the regional corpus is now disambiguated and features collocations and frequency tools. The old version of the RNC is now closed. The RNC is also migrating to a new interface. All the changes are shown in more detail in the picture.

16.12.2022

Each corpus within the RNC has its own Corpus Portrait. The Corpus Portrait functionality is designed as a tool that allows a RNC user to analyze the characteristics of a given corpus and assess whether the corpus in question is suitable for their research or teaching needs. The corpus portrait at this stage includes:

* a description of the corpus

* frequency dictionary (only in the Regional Media Corpus)  

All the RNC corpora have tags on a meta-corpus level, allowing to categorize them by historical period, text type, presence of specific annotation, etc. 

If a subcorpus has been customized, users also have access to the respective "Subcorpus Portrait" function. With this tool, by clicking the (i) link in the header of the corpus, one can see the list of selected texts and compare the statistical characteristics of the corpus to the respective parameters of the customized subcorpus. For example, one can compare the frequency dictionary of the regional corpus to those of subcorpora selected within it.

In 2023, more statistics will appear in the Corpus and Subcorpus Portraits.