Портрет корпуса

Corpora

2,738,158 texts
819,818,156 words

syntactically tagged, disambiguated

Corpora: Media | National media

National media

The national media corpus was inaugurated in 2010 and includes articles from the media since 1983 (Argumenty i Fakty newspaper) until 2021. Significant amounts of available digitalized media texts that are of great interest for monitoring real-time linguistic changes (for example, how the word smartfon appears and becomes habitual in Russian, or how the preposition po increases in usage) cannot be fully included into the main corpus, as this would distort its representativeness with regard of both genre and chronology. There is no such limitation for the separate media corpus. It is the largest subcorpus of the RNC, exceeding the main corpus and approaching the mark of 1 billion word uses.

The national media corpus includes texts from several media, both printed newspapers and digital editions, in roughly equal amounts. The corpus is being updated on an annual basis. Several dozens of millions of words are added every year.

01 Creating the corpus

Creating the corpus

The task of creating the national media corpus is being carried out by the IRL RAS group under the leadership of Svetlana Savchuk. Lev Alekseevsky, Mikhail Kudinov, Boris Orekhov, and Dmitri Sitchinava have also been involved. We thank Dmitri Levonyan and Sergei Rubakov (Corpus Technologies) for the texts provided at the initial stage of the project.

Updated on 22.07.2024