The RNC neural network models
The Russian National Corpus uses modern technologies to mark up texts. This allows us to annotate large volumes of texts in a short time.
This page contains a description of the neural network models of the RNC, instructions for their use, as well as links to download models. Previous versions of the models are available on the page "RNC archived neural network models". Before downloading the model, users should read and accept software license agreement.
Tokenizer
To tokenize the Main, Newspaper and several other RNC corpora, we use a tokenizer from the Stanza library, trained on a prepared sample of texts: open-source Taiga and SinTagRus corpora, as well as internal data from corpora of prose of the 20th-21st centuries, poetry, corpora with old orthography of the 18th century, media texts of the 21st century. The quality of segmentation into sentences on test data from the same sample is 95.6% according to the F1 metric, the quality of tokenization is 99.6%.
To run the model, you need to install the stanza library:
pip install stanza~=1.8.1
import stanza
if __name__ == "__main__":
STANZA_PATH = "YOUR_PATH_HERE.pt"
ppln = stanza.Pipeline(
lang='ru',
processors='tokenize',
tokenize_model_path=STANZA_PATH,
use_gpu=True
)
SOME_TEXT = "Однажды весною, в час небывало жаркого заката, в Москве, на Патриарших прудах, появились два гражданина. Первый из них, одетый в летнюю серенькую пару, был маленького роста, упитан, лыс, свою приличную шляпу пирожком нес в руке, а на хорошо выбритом лице его помещались сверхъестественных размеров очки в черной роговой оправе. Второй – плечистый, рыжеватый, вихрастый молодой человек в заломленной на затылок клетчатой кепке – был в ковбойке, жеваных белых брюках и в черных тапочках."
doc = ppln(SOME_TEXT)
print(doc)
Vector space models
We use word2vec models trained on texts of a specific corpus to search for word associates in RNC. Currently, models have been trained for seven corpora: Main, Media, Educational, Middle Russian, “Russian Classics” and “From 2 to 15”. The Continuous Bag-of-Words algorithm (implementation from the gensim library) was used for training. All models use a vector dimension of 300 and a window of 5 words. The threshold depends on the case and is:
- 5 entries for the Main, Middle Russian, “Russian Classics”, “From 2 to 15” corpora and the corpus of Central Media;
- 7 entries for the Regional Media corpus;
- 10 entries for the Educational corpus.
If the corpus does not have manual annotation of sentences and words, before training the corpus texts are divided into sentences and tokens using a tokenizer. Then the texts are lemmatized and marked by parts of speech in accordance with the RNC morphological standard using the Rubic model.
To run the model, you need to install the gensim library:
pip install gensim~=4.3.1
from gensim.models import Word2Vec
if __name__ == "__main__":
MODEL_PATH = "YOUR_PATH_HERE.model"
model = Word2Vec.load(MODEL_PATH)
print(model.wv.most_similar(f'лингвистика_S', topn=10))
Model for the Main corpus:
Model for Regional media corpus:
Model for "Russian classics" corpus:
Model for Middle Russian corpus:
Model for the Old East Slavic corpus:
Model for corpus "From 2 to 15":
Model for Educational corpus:
Models for morphological annotation
To generate morphemic annotation for words that are not included in dictionaries, the RNC uses a model with an architecture based on an ensemble of convolutional neural networks, proposed by A. Sorokin and A. Kravtsova. There are two such models in the RNC, which use different training sets:
- for the Main Corpus we use a model trained on the Morphodict-K morphemic dictionary, developed specifically for the Corpus based on the ideology of the “Dictionary of Morphemes of the Russian Language” by A. I. Kuznetsova and T. F. Efremova (M., 1986);
- for the Educational Corpus we use a model trained on the Morphodict-T morphemic dictionary, based on the “Morphemic-Spelling Dictionary” by A. N. Tikhonov (2002).
These dictionaries differ in the paradigm of dividing words into morphemes. You can read more about the principles of word-formation markup found in the Corpus here. We assessed the quality of the resulting models using cross-validation (N=5) using five metrics from the work of A. Sorokin and A. Kravtsova.
|
Morphodict-T
|
Morphodict-K
|
Precision
|
97.79
|
98.58
|
Recall
|
98.38
|
98.74
|
F1
|
98.09
|
98.66
|
Accuracy
|
96.61
|
97.40
|
WordAccuracy
|
88.49
|
90.82
|
To run the model you should download the library. You can install the necessary resources and run the algorithm using the mpe_morphemes.sh script, located in the archive with the model.
Model trained on the Morphodict-K morphemic dictionary:
Model trained on the Morphodict-T morphemic dictionary:
When using morphemic models in scientific work, cite the article:
T. Garipov, D. Morozov and A. Glazkova, "Generalization Ability of CNN-Based Morpheme Segmentation," 2023 Ivannikov Ispras Open Conference (ISPRAS), Moscow, Russian Federation, 2023, pp. 58-62, doi: 10.1109/ISPRAS60948.2023.10508171
Models for meta tags markup of texts
Currently, the RNC uses three models for marking up the features of texts: for genres in the Social Networks corpus, as well as for topics and types in the Regional Media corpus. For each of these cases, a training sample was prepared, consisting of texts from the corresponding corpus; the RuRoBERTa model was then trained on these data.
Genre meta tagging
The genre tagging model labels 19 categories, for example, gratitude, horoscope, anecdote. The full list of classes is specified in the launch script attached to the archive with the model.
The launch script for the model and requirements.txt is located in the downloadable archive with the model.
Topic meta tagging
The topic tagging model tags 24 categories of texts, such as nature, industry, art and culture. The full list of classes is specified in the launch script attached to the archive with the model.
The launch script for the model and requirements.txt is located in the downloadable archive with the model.
Type meta tagging
The type meta tagging model assigns text to one of 21 categories, such as appeals, notes, or interviews. The full list of classes is specified in the launch script attached to the archive with the model.
The launch script for the model and requirements.txt is located in the downloadable archive with the model.