Новости НКРЯ

04.12.2023

Speech collections within the Accentological and Spoken corpora were expanded. Transcripts of academic and political talks, TV and radio broadcasts, personal oral history, and everyday dialogic speech have been added. The size of the Spoken corpus amounts to 14 million tokens, the overall size of the Accentological corpus, the naive poetry collection included, is 134.8 million tokens.

The parallel corpus was expanded by 3 million tokens. New texts appeared within the language pairs of Czech, English, French, German, Portuguese, and Spanish with Russian. In particular, the English-Russian tier was updated with a collection of transcripts of public TED Talks, while the Portuguese-Russian subcorpus has almost doubled in size and now also includes texts created in Portuguese-speaking Africa.

In the Social Networks corpus, genres are automatically marked for all the text. Users can select one or more genres from the list. Several new genres have been identified such as picture captions.
Properties generated by NeuroRNC are marked with a special icon. If you notice an inaccuracy or error, feel free to report it using the “Report an Error” button in the same window.

Show all

RNC News