Новости НКРЯ

05.03.2024

In February, we significantly improved the National media corpus.

It was updated with new texts counting 49,6 million tokens. These are printed media from the 1990s ("Nezavisimaya Gazeta", including weekly supplements, "Moskovsky Komsomolets", and "St. Petersburg Vedomosti").

In all the texts of the corpus, grammatical homonymy has been automatically resolved and annotation of syntactic relations (starting from the second token upon clicking "add condition") has been completed. Thus all the latest functions that are already available in the Main and Regional media corpora, such as searching by syntactic relations and properties, collocation search, frequency dictionary, frequency of query results, are also searchable within the National media corpus, the largest of the three.

The RNC Media corpus is now the world's largest Russian online corpus with the ability to search by syntactic relations!

In the form of the subcorpus, it is now possible to select texts by topic and type. For annotation of these fields RuRoBERTa model is used, further trained on the Regional Corpus data. Fields in the form of subcorpus and text information with values generated by NeuroRNC, are marked with a special icon. Errors are possible with automatic annotation. There is a "Report an error" button in the text information pop-up window. Please inform us of any inaccuracies or errors in the definition of topics and types.

Show all

RNC News