Новости НКРЯ

17.10.2023

To select your own subcorpora and use our statistical services, you need metatextual annotation. RNC already features more than 6 million texts and is constantly growing. It is becoming less and less feasible to mark up such a large amount of data manually, so we are developing neural annotation services (NeuroRNC). Today we present new results in the field.

Keywords in the texts of the Regional Media corpus are annotated automatically using the adapted rutermextract model. One keyword can consist of a single token (праздник, переломы) or a two-word combination (таяние снега). The single-token query (сообщество) yields both exact matches and two-word combinations with this word (католическое сообщество).

In the Social Networks corpus, genres are automatically labeled for the main corpus texts. The RuRoBERTa model, fine-tuned on the corpus texts, is used for annotation. One or more genres can be selected from a list, e.g., recommendations and advice.

In the text information, the fields filled in by NeuroRNC are marked with a special icon. In the same pop-up window there is a "Report a bug" button. Please let us know about any inaccuracies or errors in the definition of keywords and genres.

Show all

RNC News