Corpora statistics

This page presents general statistical information about the corpora included in the Russian National Corpus (RNC).

For some RNC corpora, extended statistics are available, including not only data on the number of texts and words but also charts showing the distribution of metadata attributes, geographic maps, and volume distribution graphs by country and region (for corpora with regional annotation). To access corpus statistics, click icon  in the corpus header or click on the corpus name on this page.

In corpora with extended statistics, it is also possible to compare users’ subcorpora with the entire corpus. To view comparative data, click icon   in the subcorpus header.

Number of texts

Texts by subcorpora

Corpora Number of texts Number of sentences Number of tokens Percentage of tokens
Main 133,554 31,626,104 389,471,513 17.1%
including manually disambiguated 2,164 514,169 5,988,177 0.3%
Media 3,005,754 60,339,054 887,371,125 39.0%
National media 2,732,793 55,304,984 817,589,431 35.9%
Regional & international 272,961 5,034,070 69,781,694 3.1%
SynTagRus 1,392 111,213 1,595,659 0.1%
Social networks 1,769,770 15,576,882 166,423,951 7.3%
Spoken 4,675 2,072,584 15,034,870 0.7%
Accentological 1,344,006 13,599,857 136,104,117 6.0%
Multimedia 1,460 1,040,523 5,957,296 0.3%
MultiPARC 83 103,560 628,025 0.0%
Russian 52 56,711 375,154 0.0%
English-Russian 31 46,849 252,871 0.0%
Parallel 13,791 16,619,044 214,046,070 9.4%
English 1,556 3,476,864 51,982,439 2.3%
Armenian 28 126,636 1,570,735 0.1%
Bashkir 124 124,270 550,387 0.0%
Belarusian 312 1,162,868 10,916,697 0.5%
Bulgarian 59 418,986 5,159,901 0.2%
Buryat 7 30,750 401,516 0.0%
Veps 989 40,780 343,133 0.0%
Spanish 168 492,464 7,359,538 0.3%
Italian 126 302,264 4,930,970 0.2%
Karelian 2,355 125,702 1,223,760 0.1%
Chinese 1,075 253,500 4,422,747 0.2%
Korean 185 12,300 73,752 0.0%
Latvian 245 410,438 4,398,564 0.2%
Lithuanian 65 72,244 702,471 0.0%
German 299 2,234,024 32,276,755 1.4%
Polish 54 501,800 6,355,629 0.3%
Portuguese 38 88,572 1,602,412 0.1%
Romanian 31 60,140 903,375 0.0%
Serbian 37 144,027 1,903,176 0.1%
Slovene 53 173,172 1,989,641 0.1%
Ukrainian 865 919,426 9,383,774 0.4%
Finnish 320 299,184 3,741,431 0.2%
French 67 498,180 7,631,430 0.3%
Khakas 331 126,710 1,194,971 0.1%
Hindi β 9 9,292 122,347 0.0%
Romani 20 16,240 185,142 0.0%
Czech 556 334,562 4,387,470 0.2%
Chuvash 2,820 2,375,948 24,168,622 1.1%
Swedish 787 1,344,054 16,520,152 0.7%
Estonian 95 192,493 2,154,889 0.1%
Japanese 103 31,512 453,279 0.0%
Multilingual 12 219,642 5,034,965 0.2%
Dialect 3,067 80,102 790,310 0.0%
Educational 1,965 1,268,207 14,760,487 0.6%
From 2 to 15 75 413,494 4,413,372 0.2%
Poetry 104,714 1,382,236 14,257,912 0.6%
Russian classics β 35,960 2,158,672 26,238,321 1.2%
Historical 12,171 859,426 16,292,589 0.7%
Old East Slavic 425 913,827 0.0%
Inscriptions 749 6,039 0.0%
Birchbark letters 1,249 1,249 23,932 0.0%
Middle Russian 8,301 442,260 9,983,886 0.4%
Church Slavonic 1,447 415,917 5,364,905 0.2%
Panchronic 141,035 30,890,027 384,096,728 16.9%
Total 6,573,472 178,140,985 2,277,482,345 100%

Text types

Texts within the main corpus by type and other meta features

Text type Number of texts Number of sentences Number of tokens Percentage of tokens
Non-fiction 122,251 16,721,496 231,498,631 59.4%
Fiction 11,303 14,904,608 157,972,882 40.6%
Total 133,554 31,626,104 389,471,513 100%

Fiction

Genre Number of texts Number of sentences Number of tokens Percentage of tokens
Crime 138 817,058 7,656,452 4.8%
Children's literature 860 759,540 7,006,759 4.4%
Nonfiction 464 1,062,528 12,620,561 7.9%
Drama 306 636,194 3,440,098 2.1%
Historical prose 295 1,287,535 14,918,140 9.3%
Love story 69 190,000 1,805,976 1.1%
Medical prose 3 17,773 170,643 0.1%
No genre 6,366 8,222,858 90,763,926 56.5%
translation 30 43,291 696,938 0.4%
Adventure 273 519,559 5,506,668 3.4%
Sentimental fiction 30 10,463 167,255 0.1%
Thriller 1 6,950 60,653 0.0%
Sci-fi 774 1,007,091 10,150,032 6.3%
folklore 77 8,715 180,657 0.1%
Humour and satire 1,560 569,944 5,614,440 3.5%
Total 11,246 15,159,499 160,759,198 100%

Non-fiction

Domain Number of texts Number of sentences Number of tokens Percentage of tokens
Day-to-day life 6,802 3,214,744 33,710,023 14.3%
Official and business 3,660 353,924 5,375,466 2.3%
Technical 1,211 116,853 1,639,468 0.7%
Journalism 98,350 9,936,754 140,263,052 59.4%
Advertising 2,153 76,326 844,064 0.4%
Academic 8,369 2,565,262 44,186,654 18.7%
Fiction 57 124,643 1,257,854 0.5%
Theological 1,218 332,874 5,290,038 2.2%
Electronic communication 877 336,484 3,382,171 1.4%
Total 122,697 17,057,864 235,948,790 100%
Text topic Number of texts Number of sentences Number of tokens Percentage of tokens
Administration and management 17,487 1,430,212 17,800,937 4.5%
антропология 10 15,284 313,667 0.1%
Army and armed conflict 12,778 1,244,133 15,577,913 4.0%
Archaeology 21 2,021 29,228 0.0%
Astrology, parapsychology, esoterica 432 99,808 1,035,462 0.3%
Astronomy 449 41,100 648,036 0.2%
Business, commerce, economics, finance 12,348 741,269 10,337,210 2.6%
Biology 1,257 297,158 4,732,307 1.2%
Military affairs 13 11,495 244,685 0.1%
Geography 470 216,106 3,711,174 0.9%
Geodesy 1 746 15,342 0.0%
Geology 631 128,256 1,872,562 0.5%
Mining industry 393 25,102 419,263 0.1%
Home and home economy 1,342 130,787 1,925,568 0.5%
Leisure and entertainment 5,878 457,482 4,844,745 1.2%
Natural science 679 192,826 2,293,044 0.6%
Natural history 30 13,084 210,379 0.1%
Health and medicine 6,098 498,185 6,683,894 1.7%
IT 691 82,659 1,318,016 0.3%
Art and culture 18,724 3,473,674 41,890,867 10.6%
Art history 122 36,553 570,920 0.1%
history 5,373 1,758,270 27,607,760 7.0%
Crime 10,701 367,033 3,927,182 1.0%
Culturology 732 193,298 3,297,462 0.8%
Light industry, food industry 329 23,991 371,260 0.1%
Forestry 94 9,430 146,011 0.0%
Logic 1 3,478 51,815 0.0%
Mathematics 222 41,315 608,028 0.2%
Machinery 25 1,987 30,883 0.0%
Metallurgy 21 2,078 32,288 0.0%
Science and technology 11,860 2,427,720 40,645,938 10.3%
Education 4,146 656,607 7,610,142 1.9%
Politics and society 34,650 4,016,535 54,470,879 13.8%
Political science 18 7,009 117,321 0.0%
Law 3,704 301,081 4,697,751 1.2%
Nature 4,621 538,074 6,278,423 1.6%
Industry 5,093 348,168 4,415,326 1.1%
Accidents 237 9,552 97,367 0.0%
Psychology 712 176,527 2,811,205 0.7%
Travel 2,337 935,298 12,757,067 3.2%
Religion 7,016 1,055,934 14,963,391 3.8%
Agriculture 2,186 238,562 3,204,090 0.8%
Sociology 513 148,612 2,440,354 0.6%
Sport 4,206 288,331 3,697,728 0.9%
Statistics 374 18,439 286,021 0.1%
Construction, architecture 2,258 161,675 2,097,445 0.5%
Technology 8,279 585,399 7,514,823 1.9%
Transport 4,994 220,232 2,441,053 0.6%
Physics 1,359 126,230 1,930,683 0.5%
Philology 1,097 385,822 6,524,230 1.7%
Philosophy 893 502,349 8,862,166 2.3%
Chemical industry 108 8,028 114,948 0.0%
Chemistry 1,168 139,780 2,067,993 0.5%
Private life 21,568 4,569,054 50,039,236 12.7%
Electronics 748 45,599 670,144 0.2%
Energy industry 177 18,067 277,405 0.1%
этнография 7 8,098 164,164 0.0%
Total 221,681 29,475,602 393,745,201 100%

Dates

Texts within the main corpus by dates created

Date Number of texts Number of sentences Number of tokens Percentage of tokens
1651 - 1700 5 16,633 332,648 0.1%
1701 - 1750 382 64,591 1,255,872 0.3%
1751 - 1800 1,931 338,302 6,631,307 1.7%
1801 - 1850 3,323 1,195,349 18,670,582 4.7%
1851 - 1900 4,976 4,587,242 64,977,272 16.2%
1901 - 1950 57,915 8,883,269 103,840,824 26.0%
1951 - 2000 21,998 9,984,063 111,278,555 27.8%
2001 - 2050 43,485 7,395,050 93,108,395 23.3%
Total 134,015 32,464,499 400,095,455 100%

Parts of speech

Tokens by part of speech (Disambiguated corpus only)

Part of speech Number of tokens Percentage of tokens
noun 1,718,410 28.7%
Adjective 510,957 8.5%
Numeral 96,851 1.6%
of these, recorded in writing 43,034 0.7%
of these, recorded in numbers 53,817 0.9%
numeral adjective 24,589 0.4%
Verb 1,013,248 16.9%
Adverb 253,573 4.2%
Predicative 42,762 0.7%
Parenthesis 26,721 0.4%
Pronoun 471,700 7.9%
Adjectival pronoun 280,716 4.7%
Adverbial pronoun 130,434 2.2%
Predicative pronoun (некого, нечего) 678 0.0%
Preposition 626,906 10.5%
Conjunction 475,769 7.9%
Particle 266,675 4.5%
Interjection 8,628 0.1%
Initital 10,002 0.2%
Other (foreign words, onomatopoeia) 29,536 0.5%
Total 5,988,155 100%

Updated on