|
русская версия
Corpus Statistics
In January 2008, the Russian National Corpus contained 52 392 texts consisting of 149 357 020 tokens
I. Texts by subcorpora
| Subcorporus | Number of texts | Number of tokens
|
|---|
| The main corpus | 42 387 | 147 577 522
| disambiguated within the main corpus: | 2 215 | 5 884 661
| | The dialectal corpus | 122 | 144 099
| | The poetry corpus | 9 675 | 2 586 710
| | The educational corpus | 230 | 649 684
| |
II. Texts within the main corpus by type and other meta features
| Text type | Number of texts | Number of tokens | Percentage of tokens
|
|---|
| Fiction | 3 893 | 58 547 176 | 39,7
| | Non-fiction | 37 249 | 83 218 964 | 56,4
| | Oral presentation | 1 245 | 5 810 482 | 3,9
|
Fiction
| Genre | Number of texts | Number of tokens | Percentage of tokens
|
|---|
| Autobiography | 3 | 323 808 | 0,6
| | Crime | 65 | 4 796 573 | 8,2
| | Children's literature | 380 | 2 479 165 | 4,2
| | Documentary | 109 | 3 177 007 | 5,4
| | Drama | 92 | 1 236 635 | 2,1
| | Historical prose | 82 | 2 968 863 | 5,1
| | Love story | 29 | 622 838 | 1,1
| | No genre | 2 008 | 35 199 610 | 60,1
| | Adventure | 11 | 1 246 928 | 2,1
| | Sci-fi | 174 | 3 902 519 | 6,7
| | Humour and satire | 788 | 1 566 038 | 2,7
| | Miscellaneous | 152 | 1 027 541 | 1,8
| | Overall | 3 893 | 58 547 525 | 100,0
|
Non-fiction
| Domain | Number of texts | Number of tokens | Percentage of tokens
|
|---|
| Day-to-day life | 590 | 1 986 286 | 2,4
| | Official and business | 1 244 | 1 814 812 | 2,2
| | Technical | 131 | 233 569 | 0,3
| | Journalism | 28 659 | 56 549 987 | 68,0
| | Advertising | 1 221 | 513 955 | 0,6
| | Academic | 4 578 | 18 269 877 | 22,0
| | Theological | 735 | 2 540 674 | 3,1
| | Electronic communication | 91 | 1 310 804 | 1,6
| | Overall | 37 249 | 83 219 964 | 100,0
|
| | Theme | Number of texts | Number of tokens | Percentage of tokens
|
|---|
| Administration and management | 344 | 233 211 | 0,3
| | Army and armed conflict | 705 | 1 668 166 | 2,0
| | Astrology, parapsychology, esoterica | 54 | 49 607 | 0,1
| | Business, commerce, economics, finance | 3 447 | 2 721 353 | 3,3
| | Home and home economy | 824 | 671 190 | 0,8
| | Leisure and entertainment | 807 | 516 326 | 0,6
| | Health and medicine | 856 | 2 227 828 | 1,5
| | IT | 16 | 14 583 | 0,0
| | Art and culture | 3 289 | 5 236 518 | 6,3
| | Crime | 658 | 496 855 | 0,6
| | Memoirs and diaries | 419 | 20 604 802 | 24,8
| | Science and technology | 5 446 | 17 334 780 | 20,8
| | Education | 163 | 172 806 | 0,2
| | Politics and society | 11 217 | 15 526 635 | 18,7
| | Law | 506 | 1 359 877 | 1,6
| | Nature | 327 | 550 070 | 0,7
| | Industry | 1 007 | 912 649 | 1,1
| | Religion | 1 036 | 3 502 679 | 4,2
| | Agriculture | 211 | 129 089 | 0,2
| | Sport | 1 377 | 1 779 891 | 2,1
| | Machinery | 869 | 820 206 | 1,0
| | Transport | 169 | 162 310 | 0,2
| | Philosophy | 90 | 1 411 709 | 1,7
| | Private life | 2 927 | 5 730 035 | 6,9
| | Overall: | 37 249 | 83 219 964 | 100,0
|
Oral presentation
| Type | Number of texts | Number of tokens | Percentage of tokens
|
|---|
| Public speech | 617 | 3 738 790 | 64,3
| | Spontaneous speech | 445 | 470 597 | 8,1
| | Movie | 183 | 1 601 095 | 27,6
| | Overall | 1 245 | 5 810 482 | 100,0
|
II. Tokens by part of speech
(Disambiguated corpus only; on November 28, 2007 the volume of the disambiguated corpus is 5,5 million tokens)
| Part of speech | Number of tokens | Percentage of tokens
|
|---|
| Noun | 1 554 272 | 28,50
| | Adjective | 465 743 | 8,54
| | Numeral | 82 809 | 1,52
| of these, recorded in writing | 39 827 | 0,73
| | of these, recorded in numbers | 42 982 | 0,79
| | Numeral adjective | 21 081 | 0,39
| | Verb | 931 687 | 17,08
| | Adverb | 222 502 | 4,08
| | Predicative | 38 260 | 0,70
| | Parenthesis | 24 954 | 0,46
| | Pronoun | 443 205 | 8,13
| | Adjectival pronoun | 255 772 | 4,69
| | Adverbial pronoun | 120 568 | 2,21
| | Predicative pronoun (некого, нечего) | 602 | 0,01
| | Preposition | 568 295 | 10,42
| | Conjunction | 433 815 | 7,95
| | Particle | 258 085 | 4,73
| | Interjection | 7 192 | 0,13
| | Initial | 9 726 | 0,18
| | Other (foreign words, onomatopoeia) | 15 781 | 0,29
| | Overall | 5 454 349 | 100,00
| |
|