The corpus of Regional and international media was opened for general access in 2015. It features media of different territorial levels, namely regional issues of central newspapers, regional newspapers, and local publications. The timespan of the texts is 1996-2020. The geography of publications is broad and covers all federal districts of Russia, as well as some other countries (Belarus, Moldova, Kyrgyzstan, the Baltic states).
The current version of the regional media corpus features four relatively independent collections: texts of Russian-language newspapers published in Brest and Hrodna regions of the Republic of Belarus («Linguistic illustrative corpus of Hrodna regional media»), two collections of Russian regional newspapers (newspapers of 1990-2000s and media of 2010s), and a collection of regional issues of Komsomolskaya Pravda. The user can work with them both as a single dataset and with each collection individually. These and many other features are provided by the corpus search.
Since 2022, a number of parameters are available in the regional corpus in test mode. Similar markup will be extended to all the RNC texts written in modern Russian.
First, searching not only with a non-disambiguated version but also by automatically resolved homonymy is available. Within the entirety of the regional and international media, the most likely lemmas and grammatical features are assigned. The annotation was done using a neural network model trained on a 6-million corpus with manual disambiguation. There are possible errors in the choice of grammatical labels as well as in the choice (and appearance) of lemmas.
Second, within the regional corpus syntactic groups are marked, such as clause types (predicate groups), subject and predicate groups, and other parameters. Annotation was also carried out by training a neural network.
Since October 2023, the texts of the corpus are annotated by keywords. The keywords are specified by the NeuroRNC network using the fine-tuned rutermextract model. A single keyword may consist either of a single noun in singular or plural (праздник, переломы) or of a two-token nominal phrase (таяние снега, обычные дни, Иван Петров). In the query, a space is construed as the space within a nominal phrase. Multiple keywords may be separated by a comma (logical AND) or by a | sign (logical OR). A single-token query like община yields both exact matches and nominal phrases such as католическая община.