Corpora
  • 110,265 texts
  • 35,489,528 words
syntactically tagged, disambiguated
Corpora: Media | Regional & international
Regional media corpus

The corpus of Regional and international media was opened for general access in 2015. It features media of different territorial levels, namely regional issues of central newspapers, regional newspapers, and local publications. The timespan of the texts is 1996-2020. The geography of publications is broad and covers all federal districts of Russia, as well as some other countries (Belarus, Moldova, Kyrgyzstan, the Baltic states).

The current version of the regional media corpus features four relatively independent collections: texts of Russian-language newspapers published in Brest and Hrodna regions of the Republic of Belarus («Linguistic illustrative corpus of Hrodna regional media»‎), two collections of Russian regional newspapers (newspapers of 1990-2000s and media of 2010s), and a collection of regional issues of Komsomolskaya Pravda. The user can work with them both as a single dataset and with each collection individually. These and many other features are provided by the corpus search.

Since 2022, a number of parameters are available in the regional corpus in test mode. Similar markup will be extended to all the RNC texts written in modern Russian.

First, searching not only with a non-disambiguated version but also by automatically resolved homonymy is available. Within the entirety of the regional and international media, the most likely lemmas and grammatical features are assigned. The annotation was done using a neural network model trained on a 6-million corpus with manual disambiguation. There are possible errors in the choice of grammatical labels as well as in the choice (and appearance) of lemmas.

Second, within the regional corpus syntactic groups are marked, such as clause types (predicate groups), subject and predicate groups, and other parameters. Annotation was also carried out by training a neural network.

Since October 2023, the texts of the corpus are annotated by keywords. The keywords are specified by the NeuroRNC network using the fine-tuned rutermextract model. A single keyword may consist either of a single noun in singular or plural (праздник, переломы) or of a two-token nominal phrase (таяние снега, обычные дни, Иван Петров). In the query, a space is construed as the space within a nominal phrase. Multiple keywords may be separated by a comma (logical AND) or by a | sign (logical OR). A single-token query like община yields both exact matches and nominal phrases such as католическая община.

Publications

Check out the list of scientific publications on the corpus of regional media via the link: https://ruscorpora.ru/s/e1qnG. To find other types of publications related to the corpus, use the filters in the "Publications" section.

Corpus creation

The corpus of Russian regional and international media was built with the support of the Russian Science Foundation (grant 13-24-01004). It includes the illustrative linguistic corpus of mass media from Hrodna region, built with support of the Belarusian Foundation for Fundamental research project G13R-050 by the staff of the Department of General and Slavic Linguistics of Yanka Kupala State University of Hrodna. The supervisor of the project was Liudmila Rychkova. Other participants include Alesia Stankevich, Iryna Chepikava, and Aliona Mokhan'. Links to publications can be found in the «‎Publications»‎ section and at the Studiorum.

Further development of the corpus was carried out by the IRL RAS group headed by Svetlana Savchuk with the support of grant no. 17-29-09154 from the Russian Foundation for Fundamental Research (project leader: Galina Kustova). Ilya Makarchuk, Elena Morozova, Ivan.Mukhin, Boris Orekhov, and Evgenia Slepak participated in the project.

Updated on 22.07.2024