Corpora statistics

This page presents general statistical information about the corpora included in the Russian National Corpus (RNC).

For some RNC corpora, extended statistics are available, including not only data on the number of texts and words but also charts showing the distribution of metadata attributes, geographic maps, and volume distribution graphs by country and region (for corpora with regional annotation). To access corpus statistics, click icon  in the corpus header or click on the corpus name on this page.

In corpora with extended statistics, it is also possible to compare users’ subcorpora with the entire corpus. To view comparative data, click icon   in the subcorpus header.

Number of texts

Texts by subcorpora

Corpora Number of texts Number of sentences Number of tokens
Main 133,554 31,626,104 389,471,513
including manually disambiguated 2,164 514,169 5,988,177
Media 3,011,119 60,480,411 889,599,850
National media 2,738,158 55,446,341 819,818,156
Regional & international 272,961 5,034,070 69,781,694
SynTagRus 1,392 111,213 1,595,659
Social networks 1,769,770 14,860,139 166,040,568
Spoken 4,675 2,072,584 15,034,870
Accentological 1,347,103 13,622,822 136,323,786
Multimedia 1,460 1,040,523 5,957,296
MultiPARC 83 103,560 628,025
Russian 52 56,711 375,154
English-Russian 31 46,849 252,871
Parallel 13,811 17,863,300 215,021,631
English 1,556 3,805,823 52,036,252
Armenian 28 137,181 1,571,826
Bashkir 124 124,625 550,275
Belarusian 312 1,189,984 10,963,841
Bulgarian 59 464,302 5,161,112
Buryat 7 35,978 351,186
Veps 989 40,783 315,712
Spanish 177 615,310 8,407,311
Italian 126 337,042 4,931,443
Karelian 2,355 126,370 1,099,296
Chinese 1,075 274,814 4,423,176
Korean 185 12,469 73,752
Latvian 245 419,499 4,401,672
Lithuanian 65 72,986 702,553
German 299 2,620,957 32,288,903
Polish 54 529,938 6,356,547
Portuguese 38 97,327 1,603,404
Romanian 31 63,388 903,638
Serbian 37 146,686 1,903,244
Slovene 53 177,211 1,989,760
Ukrainian 865 954,664 9,395,203
Finnish 320 312,182 3,747,753
French 67 554,364 7,636,874
Khakas 331 132,170 1,191,344
Hindi β 9 9,490 122,486
Romani 20 16,743 183,979
Czech 556 375,024 4,389,859
Chuvash 2,831 2,390,589 24,202,403
Swedish 787 1,375,132 16,503,062
Estonian 95 196,183 2,124,609
Japanese 103 34,444 454,191
Multilingual 12 219,642 5,034,965
Dialect 3,067 80,102 790,310
Educational 1,965 1,268,207 14,760,487
From 2 to 15 75 413,494 4,413,372
Poetry 107,811 1,405,201 14,477,581
Russian classics β 35,960 2,158,672 26,238,321
Historical 12,171 859,426 16,292,589
Old East Slavic 425 913,827
Inscriptions 749 6,039
Birchbark letters 1,249 1,249 23,932
Middle Russian 8,301 442,260 9,983,886
Church Slavonic 1,447 415,917 5,364,905
Panchronic 141,035 30,890,027 384,096,728
Total 643,636,369 2,313,689,371 13,570,359,906

Text types

Texts within the main corpus by type and other meta features

Text type Number of texts Number of sentences Number of tokens Percentage of tokens
Non-fiction 122,251 16,721,496 231,498,631 59.4%
Fiction 11,303 14,904,608 157,972,882 40.6%
Total 133,554 31,626,104 389,471,513 100%

Fiction

Genre Number of texts Number of sentences Number of tokens Percentage of tokens
Crime 138 817,058 7,656,452 4.8%
Children's literature 860 759,540 7,006,759 4.4%
Nonfiction 464 1,062,528 12,620,561 7.9%
Drama 306 636,194 3,440,098 2.1%
Historical prose 295 1,287,535 14,918,140 9.3%
Love story 69 190,000 1,805,976 1.1%
Medical prose 3 17,773 170,643 0.1%
No genre 6,366 8,222,858 90,763,926 56.5%
translation 30 43,291 696,938 0.4%
Adventure 273 519,559 5,506,668 3.4%
Sentimental fiction 30 10,463 167,255 0.1%
Thriller 1 6,950 60,653 0.0%
Sci-fi 774 1,007,091 10,150,032 6.3%
folklore 77 8,715 180,657 0.1%
Humour and satire 1,560 569,944 5,614,440 3.5%
Total 11,246 15,159,499 160,759,198 100%

Non-fiction

Domain Number of texts Number of sentences Number of tokens Percentage of tokens
Day-to-day life 6,802 3,214,744 33,710,023 14.3%
Official and business 3,660 353,924 5,375,466 2.3%
Technical 1,211 116,853 1,639,468 0.7%
Journalism 98,350 9,936,754 140,263,052 59.4%
Advertising 2,153 76,326 844,064 0.4%
Academic 8,369 2,565,262 44,186,654 18.7%
Fiction 57 124,643 1,257,854 0.5%
Theological 1,218 332,874 5,290,038 2.2%
Electronic communication 877 336,484 3,382,171 1.4%
Total 122,697 17,057,864 235,948,790 100%
Text topic Number of texts Number of sentences Number of tokens Percentage of tokens
Administration and management 17,487 1,430,212 17,800,937 4.5%
антропология 10 15,284 313,667 0.1%
Army and armed conflict 12,778 1,244,133 15,577,913 4.0%
Archaeology 21 2,021 29,228 0.0%
Astrology, parapsychology, esoterica 432 99,808 1,035,462 0.3%
Astronomy 449 41,100 648,036 0.2%
Business, commerce, economics, finance 12,348 741,269 10,337,210 2.6%
Biology 1,257 297,158 4,732,307 1.2%
Military affairs 13 11,495 244,685 0.1%
Geography 470 216,106 3,711,174 0.9%
Geodesy 1 746 15,342 0.0%
Geology 631 128,256 1,872,562 0.5%
Mining industry 393 25,102 419,263 0.1%
Home and home economy 1,342 130,787 1,925,568 0.5%
Leisure and entertainment 5,878 457,482 4,844,745 1.2%
Natural science 679 192,826 2,293,044 0.6%
Natural history 30 13,084 210,379 0.1%
Health and medicine 6,098 498,185 6,683,894 1.7%
IT 691 82,659 1,318,016 0.3%
Art and culture 18,724 3,473,674 41,890,867 10.6%
Art history 122 36,553 570,920 0.1%
history 5,373 1,758,270 27,607,760 7.0%
Crime 10,701 367,033 3,927,182 1.0%
Culturology 732 193,298 3,297,462 0.8%
Light industry, food industry 329 23,991 371,260 0.1%
Forestry 94 9,430 146,011 0.0%
Logic 1 3,478 51,815 0.0%
Mathematics 222 41,315 608,028 0.2%
Machinery 25 1,987 30,883 0.0%
Metallurgy 21 2,078 32,288 0.0%
Science and technology 11,860 2,427,720 40,645,938 10.3%
Education 4,146 656,607 7,610,142 1.9%
Politics and society 34,650 4,016,535 54,470,879 13.8%
Political science 18 7,009 117,321 0.0%
Law 3,704 301,081 4,697,751 1.2%
Nature 4,621 538,074 6,278,423 1.6%
Industry 5,093 348,168 4,415,326 1.1%
Accidents 237 9,552 97,367 0.0%
Psychology 712 176,527 2,811,205 0.7%
Travel 2,337 935,298 12,757,067 3.2%
Religion 7,016 1,055,934 14,963,391 3.8%
Agriculture 2,186 238,562 3,204,090 0.8%
Sociology 513 148,612 2,440,354 0.6%
Sport 4,206 288,331 3,697,728 0.9%
Statistics 374 18,439 286,021 0.1%
Construction, architecture 2,258 161,675 2,097,445 0.5%
Technology 8,279 585,399 7,514,823 1.9%
Transport 4,994 220,232 2,441,053 0.6%
Physics 1,359 126,230 1,930,683 0.5%
Philology 1,097 385,822 6,524,230 1.7%
Philosophy 893 502,349 8,862,166 2.3%
Chemical industry 108 8,028 114,948 0.0%
Chemistry 1,168 139,780 2,067,993 0.5%
Private life 21,568 4,569,054 50,039,236 12.7%
Electronics 748 45,599 670,144 0.2%
Energy industry 177 18,067 277,405 0.1%
этнография 7 8,098 164,164 0.0%
Total 221,681 29,475,602 393,745,201 100%

Dates

Texts within the main corpus by dates created

Date Number of texts Number of sentences Number of tokens Percentage of tokens
1651 - 1700 5 16,633 332,648 0.1%
1701 - 1750 382 64,591 1,255,872 0.3%
1751 - 1800 1,931 338,302 6,631,307 1.7%
1801 - 1850 3,323 1,195,349 18,670,582 4.7%
1851 - 1900 4,976 4,587,242 64,977,272 16.2%
1901 - 1950 57,915 8,883,269 103,840,824 26.0%
1951 - 2000 21,998 9,984,063 111,278,555 27.8%
2001 - 2050 43,485 7,395,050 93,108,395 23.3%
Total 134,015 32,464,499 400,095,455 100%

Parts of speech

Tokens by part of speech (Disambiguated corpus only)

Part of speech Number of tokens Percentage of tokens
noun 1,718,410 28.7%
Adjective 510,957 8.5%
Numeral 96,851 1.6%
of these, recorded in writing 43,034 0.7%
of these, recorded in numbers 53,817 0.9%
numeral adjective 24,589 0.4%
Verb 1,013,248 16.9%
Adverb 253,573 4.2%
Predicative 42,762 0.7%
Parenthesis 26,721 0.4%
Pronoun 471,700 7.9%
Adjectival pronoun 280,716 4.7%
Adverbial pronoun 130,434 2.2%
Predicative pronoun (некого, нечего) 678 0.0%
Preposition 626,906 10.5%
Conjunction 475,769 7.9%
Particle 266,675 4.5%
Interjection 8,628 0.1%
Initital 10,002 0.2%
Other (foreign words, onomatopoeia) 29,536 0.5%
Total 5,988,155 100%

Updated on