русская версия
Corpus Statistics
In January 2008, the Russian National Corpus contained 52 392 texts consisting of 149 357 020 tokens
I. Texts by subcorpora
Subcorporus | Number of texts | Number of tokens
|
---|
The main corpus | 42 387 | 147 577 522
disambiguated within the main corpus: | 2 215 | 5 884 661
| The dialectal corpus | 122 | 144 099
| The poetry corpus | 9 675 | 2 586 710
| The educational corpus | 230 | 649 684
| |
II. Texts within the main corpus by type and other meta features
Text type | Number of texts | Number of tokens | Percentage of tokens
|
---|
Fiction | 3 893 | 58 547 176 | 39,7
| Non-fiction | 37 249 | 83 218 964 | 56,4
| Oral presentation | 1 245 | 5 810 482 | 3,9
|
Fiction
Genre | Number of texts | Number of tokens | Percentage of tokens
|
---|
Autobiography | 3 | 323 808 | 0,6
| Crime | 65 | 4 796 573 | 8,2
| Children's literature | 380 | 2 479 165 | 4,2
| Documentary | 109 | 3 177 007 | 5,4
| Drama | 92 | 1 236 635 | 2,1
| Historical prose | 82 | 2 968 863 | 5,1
| Love story | 29 | 622 838 | 1,1
| No genre | 2 008 | 35 199 610 | 60,1
| Adventure | 11 | 1 246 928 | 2,1
| Sci-fi | 174 | 3 902 519 | 6,7
| Humour and satire | 788 | 1 566 038 | 2,7
| Miscellaneous | 152 | 1 027 541 | 1,8
| Overall | 3 893 | 58 547 525 | 100,0
|
Non-fiction
Domain | Number of texts | Number of tokens | Percentage of tokens
|
---|
Day-to-day life | 590 | 1 986 286 | 2,4
| Official and business | 1 244 | 1 814 812 | 2,2
| Technical | 131 | 233 569 | 0,3
| Journalism | 28 659 | 56 549 987 | 68,0
| Advertising | 1 221 | 513 955 | 0,6
| Academic | 4 578 | 18 269 877 | 22,0
| Theological | 735 | 2 540 674 | 3,1
| Electronic communication | 91 | 1 310 804 | 1,6
| Overall | 37 249 | 83 219 964 | 100,0
|
| Theme | Number of texts | Number of tokens | Percentage of tokens
|
---|
Administration and management | 344 | 233 211 | 0,3
| Army and armed conflict | 705 | 1 668 166 | 2,0
| Astrology, parapsychology, esoterica | 54 | 49 607 | 0,1
| Business, commerce, economics, finance | 3 447 | 2 721 353 | 3,3
| Home and home economy | 824 | 671 190 | 0,8
| Leisure and entertainment | 807 | 516 326 | 0,6
| Health and medicine | 856 | 2 227 828 | 1,5
| IT | 16 | 14 583 | 0,0
| Art and culture | 3 289 | 5 236 518 | 6,3
| Crime | 658 | 496 855 | 0,6
| Memoirs and diaries | 419 | 20 604 802 | 24,8
| Science and technology | 5 446 | 17 334 780 | 20,8
| Education | 163 | 172 806 | 0,2
| Politics and society | 11 217 | 15 526 635 | 18,7
| Law | 506 | 1 359 877 | 1,6
| Nature | 327 | 550 070 | 0,7
| Industry | 1 007 | 912 649 | 1,1
| Religion | 1 036 | 3 502 679 | 4,2
| Agriculture | 211 | 129 089 | 0,2
| Sport | 1 377 | 1 779 891 | 2,1
| Machinery | 869 | 820 206 | 1,0
| Transport | 169 | 162 310 | 0,2
| Philosophy | 90 | 1 411 709 | 1,7
| Private life | 2 927 | 5 730 035 | 6,9
| Overall: | 37 249 | 83 219 964 | 100,0
|
Oral presentation
Type | Number of texts | Number of tokens | Percentage of tokens
|
---|
Public speech | 617 | 3 738 790 | 64,3
| Spontaneous speech | 445 | 470 597 | 8,1
| Movie | 183 | 1 601 095 | 27,6
| Overall | 1 245 | 5 810 482 | 100,0
|
II. Tokens by part of speech
(Disambiguated corpus only; on November 28, 2007 the volume of the disambiguated corpus is 5,5 million tokens)
Part of speech | Number of tokens | Percentage of tokens
|
---|
Noun | 1 554 272 | 28,50
| Adjective | 465 743 | 8,54
| Numeral | 82 809 | 1,52
of these, recorded in writing | 39 827 | 0,73
| of these, recorded in numbers | 42 982 | 0,79
| Numeral adjective | 21 081 | 0,39
| Verb | 931 687 | 17,08
| Adverb | 222 502 | 4,08
| Predicative | 38 260 | 0,70
| Parenthesis | 24 954 | 0,46
| Pronoun | 443 205 | 8,13
| Adjectival pronoun | 255 772 | 4,69
| Adverbial pronoun | 120 568 | 2,21
| Predicative pronoun (некого, нечего) | 602 | 0,01
| Preposition | 568 295 | 10,42
| Conjunction | 433 815 | 7,95
| Particle | 258 085 | 4,73
| Interjection | 7 192 | 0,13
| Initial | 9 726 | 0,18
| Other (foreign words, onomatopoeia) | 15 781 | 0,29
| Overall | 5 454 349 | 100,00
| |
In the Russian version of the RNC, the frequency lists of the main corpora (tokens, bigrams, trigrams, 4-word and 5-word collocations) are avalable at this address.
|