RNC News

The Main corpus of the RNC has been expanded by 15 million tokens, representing several thematic collections: plays of different periods, official texts and legalese, academic journals, natural science reference books of the 18th century, mass literature (for example, both pre-revolutionary and post-Soviet love novels) and much more.

In all the texts in the Corpus, grammatical ambiguity was automatically removed and syntactic relationship annotation was added. An updated version of the RuBic neural network model was used for the annotation, significantly improving the lemmatization of words. On the test dataset, the percentage of erroneous lemmas in the corpus decreased from 4.24% to 1.39%. Please let us know when you encounter errors in automatic word annotation. To do so, click a word and select “Report a bug” in the pop-up window. 

Next to some examples in the Corpus, blue fields have appeared with the name of the direct speech subject (a character in a play or a speaker in a spoken text). If you click on this field, you can mark the gender, age, year of birth, job and/or role of the character or, respectively, of the speaker.

The morpheme annotation in the word search and in the Word at a glance has been synchronized. For the words absent from the RNC morphemic dictionary, morpheme parses are generated using a neural network model. The dictionary has been expanded and its consistency has been improved. Words whose parses are generated by the neural network now also participate in the word structure search, and the morphemic structure of a word is also available in the pop-up window.