Language in social networks is the most dynamic and free from regulatory restrictions. The need of linguists for such a tool, which, on the one hand, would be provided with RNC markup and, on the other hand, would allow search through a large volume of texts of digital communication, has long been felt.
In this case, we understand "social networks" as widely as possible, including blog posts and messengers.
For some of the texts, their dialogic nature is saved: the corpus allows us to search separately by entries and by comments to them. In the case of searching by comment, you can see the original post starting the topic.
For all corpus texts the genres are automatically annotated. The RuRoBERTa model was used for annotation. The model was fine-tuned on the texts of the corpus.
There is an constantly expanding range of possibilities for non-lexical expression of emotion in social media texts (emoticons, emoji, occasionally used symbols). Markup in the corpus is forced to simplify this aspect of electronic communication. Emoticon search is not available in the current version of the corpus, but is planned for the future.
The corpus currently includes more than 160 million word uses since 2001, and will be expanded with more deeply annotated texts.