News

03.12.2024

We are pleased to announce an important update of the search form on the Russian National Corpus website! Now users can add words before Word 1, thus making it much easier to compose and edit complex queries.

Previously, words could only be added to the right of Word 1 and subsequent words. For example, if you were searching for a construction like “adjective + pronoun + дорога (‘road’)”, having specified the syntactic relationship between these words, but then decided to search for “conjuction + adjective + pronoun + дорога (‘road’)”, you would have to rebuild the query from scratch. Now things are getting simpler: just click the “+” button on the left of the Word 1 and specify any attribute such as “conjunction”.

Please note: the principle of calculating the distance between words remains unchanged. The distance is always set from left to right: from the new Word 1 to the original Word 1, and then to the subsequent words.

03.12.2024

There are new features in the Parallel Corpus that facilitate processing it. 

In the Japanese texts, the Semantics search field has been added to the bilingual search form. Japanese has become the first language with semantic annotation other than Russian.

In the Karelian, Veps, Chuvash, and Khakas language pairs the possibilities of subcorpus customization have been extended. One can select texts by genre and type (for all these four languages), as well as by text topic (in the Chuvash corpus). For all the parallel corpora, the feature that selects a subcorpus by number of tokens is available, so that one can analyze texts of different sizes.

Search results can now be sorted using six new sorting keys: by the date of text in a given language (Russian or foreign) from older texts to newer ones and vice versa, provided that originals and translations are considered together or separately. The new sorting keys will help you find the information you need faster and better structure your data.

03.12.2024

Texts by four poets: Vadim Shefner, Robert Rozhdestvensky, Lev Loseff and Maria Stepanova, have been added to the Poetry corpus. The size of the update is 200,000 words, 2,000 texts, and 44,000 verse lines. The whole corpus features almost 3 million lines.

Searching for a word at the beginning and at the end of a poetic line is available. One can easily find that the characteristic poetisms ужель ('is it so?') or вотще ('in vain') are more often found at the beginning of a line than in any other position of the verse.

03.12.2024

The Main corpus of the RNC has been expanded by 15 million tokens, representing several thematic collections: plays of different periods, official texts and legalese, academic journals, natural science reference books of the 18th century, mass literature (for example, both pre-revolutionary and post-Soviet love novels) and much more.

In all the texts in the Corpus, grammatical ambiguity was automatically removed and syntactic relationship annotation was added. An updated version of the RuBic neural network model was used for the annotation, significantly improving the lemmatization of words. On the test dataset, the percentage of erroneous lemmas in the corpus decreased from 4.24% to 1.39%. Please let us know when you encounter errors in automatic word annotation. To do so, click a word and select “Report a bug” in the pop-up window. 

Next to some examples in the Corpus, blue fields have appeared with the name of the direct speech subject (a character in a play or a speaker in a spoken text). If you click on this field, you can mark the gender, age, year of birth, job and/or role of the character or, respectively, of the speaker.

The morpheme annotation in the word search and in the Word at a glance has been synchronized. For the words absent from the RNC morphemic dictionary, morpheme parses are generated using a neural network model. The dictionary has been expanded and its consistency has been improved. Words whose parses are generated by the neural network now also participate in the word structure search, and the morphemic structure of a word is also available in the pop-up window.