русская версия

Corpus Statistics

In January 2008, the Russian National Corpus contained 52 392 texts consisting of 149 357 020 tokens

I. Texts by subcorpora

Subcorporus Number of texts Number of tokens
The main corpus42 387 147 577 522
disambiguated within the main corpus: 2 215 5 884 661
The dialectal corpus 122 144 099
The poetry corpus 9 675 2 586 710
The educational corpus 230 649 684

II. Texts within the main corpus by type and other meta features

Text type Number of texts Number of tokensPercentage of tokens
Fiction 3 893 58 547 176 39,7
Non-fiction 37 249 83 218 964 56,4
Oral presentation 1 245 5 810 482 3,9

Fiction

Genre Number of texts Number of tokensPercentage of tokens
Autobiography 3 323 808 0,6
Crime 65 4 796 573 8,2
Children's literature 380 2 479 165 4,2
Documentary 109 3 177 007 5,4
Drama 92 1 236 635 2,1
Historical prose 82 2 968 863 5,1
Love story 29 622 838 1,1
No genre 2 008 35 199 610 60,1
Adventure 11 1 246 928 2,1
Sci-fi 174 3 902 519 6,7
Humour and satire 788 1 566 038 2,7
Miscellaneous 152 1 027 541 1,8
Overall 3 893 58 547 525 100,0

Non-fiction

Domain Number of texts Number of tokensPercentage of tokens
Day-to-day life 590 1 986 286 2,4
Official and business1 244 1 814 812 2,2
Technical131 233 569 0,3
Journalism28 659 56 549 987 68,0
Advertising1 221 513 955 0,6
Academic4 578 18 269 877 22,0
Theological735 2 540 674 3,1
Electronic communication91 1 310 804 1,6
Overall 37 249 83 219 964 100,0

Theme Number of texts Number of tokensPercentage of tokens
Administration and management 344 233 211 0,3
Army and armed conflict 705 1 668 166 2,0
Astrology, parapsychology, esoterica54 49 607 0,1
Business, commerce, economics, finance 3 447 2 721 353 3,3
Home and home economy 824 671 190 0,8
Leisure and entertainment 807 516 326 0,6
Health and medicine 856 2 227 828 1,5
IT 16 14 583 0,0
Art and culture 3 289 5 236 518 6,3
Crime 658 496 855 0,6
Memoirs and diaries 419 20 604 802 24,8
Science and technology 5 446 17 334 780 20,8
Education 163 172 806 0,2
Politics and society 11 217 15 526 635 18,7
Law 506 1 359 877 1,6
Nature 327 550 070 0,7
Industry 1 007 912 649 1,1
Religion 1 036 3 502 679 4,2
Agriculture 211 129 089 0,2
Sport 1 377 1 779 891 2,1
Machinery 869 820 206 1,0
Transport 169 162 310 0,2
Philosophy 90 1 411 709 1,7
Private life 2 927 5 730 035 6,9
Overall: 37 249 83 219 964 100,0

Oral presentation

Type Number of texts Number of tokensPercentage of tokens
Public speech 617 3 738 790 64,3
Spontaneous speech 445 470 597 8,1
Movie 183 1 601 095 27,6
Overall 1 245 5 810 482 100,0

II. Tokens by part of speech
(Disambiguated corpus only; on November 28, 2007 the volume of the disambiguated corpus is 5,5 million tokens)

Part of speech Number of tokens Percentage of tokens
Noun 1 554 272 28,50
Adjective 465 743 8,54
Numeral 82 809 1,52
of these, recorded in writing 39 827 0,73
of these, recorded in numbers 42 982 0,79
Numeral adjective 21 081 0,39
Verb 931 687 17,08
Adverb222 502 4,08
Predicative 38 260 0,70
Parenthesis 24 954 0,46
Pronoun 443 205 8,13
Adjectival pronoun 255 772 4,69
Adverbial pronoun 120 568 2,21
Predicative pronoun (некого, нечего) 602 0,01
Preposition 568 295 10,42
Conjunction 433 815 7,95
Particle 258 085 4,73
Interjection 7 192 0,13
Initial 9 726 0,18
Other (foreign words, onomatopoeia) 15 781 0,29
Overall 5 454 349 100,00

In the Russian version of the RNC, the frequency lists of the main corpora (tokens, bigrams, trigrams, 4-word and 5-word collocations) are avalable at this address.
Russian National Corpus
© 2003–2017
info@ruscorpora.ru