RNC News

The size of parallel corpora reached 210 million tokens. New parallel corpora for four languages of Russia have appeared: Chuvash (24 million words), Karelian (1.2 million), Veps (340 thousand) and the North Russian dialect of Romani (170 thousand). The bilingual pairs were prepared in cooperation with the developers of standalone large-scale corpus projects for these languages. In some language pairs, extended metatextual information is available, including data on source, genre, type and topic of the text. The size of the existing parallel corpora has also been expanded: English (by 5 million), Spanish (by 700,000) and Czech (by 15,000).

The Social networks corpus has been expanded to 3.5 million words. It includes a collection of texts prepared by the staff of Voronezh State University. The collection consists of posts by well-known bloggers, discussions in local networks, as well as in local groups on popular platforms VK, Telegram, LiveJournal, Zen and others. The materials are collected in the Arkhangelsk, Astrakhan, Kursk, Rostov, Ryazan and Tambov Regions and cover a large timespan of 2005—2023.