Number of texts
Texts by subcorpora
Corpora | Number of texts | Number of sentences | Number of tokens | Percentage of tokens |
---|---|---|---|---|
Main | 131,488 | 30,499,014 | 374,449,975 | 17.0% |
including manually disambiguated | 2,170 | 519,726 | 6,106,069 | 0.3% |
Media | 2,838,953 | 57,862,754 | 850,630,557 | 38.6% |
National media | 2,728,688 | 55,215,073 | 815,141,029 | 37.0% |
Regional & international | 110,265 | 2,647,681 | 35,489,528 | 1.6% |
SynTagRus | 1,304 | 109,886 | 1,568,027 | 0.1% |
Social networks | 1,768,134 | 14,443,641 | 161,432,452 | 7.3% |
Spoken | 4,514 | 2,018,401 | 14,554,052 | 0.7% |
Accentological | 1,340,746 | 13,534,835 | 135,550,981 | 6.1% |
Multimedia | 1,383 | 1,019,768 | 5,763,881 | 0.3% |
MultiPARC | 51 | 79,672 | 458,531 | 0.0% |
Russian | 21 | 36,567 | 229,200 | 0.0% |
English-Russian | 30 | 43,105 | 229,331 | 0.0% |
Parallel | 13,649 | 16,412,732 | 210,729,332 | 9.6% |
English | 1,438 | 3,386,548 | 50,422,889 | 2.3% |
Armenian | 28 | 126,636 | 1,570,735 | 0.1% |
Bashkir | 124 | 124,270 | 550,387 | 0.0% |
Belarusian | 312 | 1,162,868 | 10,916,697 | 0.5% |
Bulgarian | 59 | 418,986 | 5,159,901 | 0.2% |
Buryat | 7 | 30,750 | 401,516 | 0.0% |
Veps | 989 | 40,780 | 343,133 | 0.0% |
Spanish | 150 | 416,474 | 6,148,157 | 0.3% |
Italian | 126 | 302,264 | 4,930,970 | 0.2% |
Karelian | 2,355 | 125,702 | 1,223,760 | 0.1% |
Chinese | 1,075 | 253,500 | 4,422,747 | 0.2% |
Korean | 185 | 12,300 | 73,752 | 0.0% |
Latvian | 245 | 410,438 | 4,398,564 | 0.2% |
Lithuanian | 65 | 72,244 | 702,471 | 0.0% |
German | 294 | 2,194,004 | 31,742,105 | 1.4% |
Polish | 54 | 501,800 | 6,355,629 | 0.3% |
Portuguese | 38 | 88,572 | 1,602,412 | 0.1% |
Romanian | 31 | 60,140 | 903,375 | 0.0% |
Serbian | 37 | 144,027 | 1,903,176 | 0.1% |
Slovene | 53 | 173,172 | 1,989,641 | 0.1% |
Ukrainian | 865 | 919,426 | 9,383,774 | 0.4% |
Finnish | 320 | 299,184 | 3,741,431 | 0.2% |
French | 67 | 498,180 | 7,631,430 | 0.3% |
Khakas | 331 | 126,710 | 1,194,971 | 0.1% |
Hindi β | 9 | 9,292 | 122,347 | 0.0% |
Romani β | 19 | 16,254 | 170,559 | 0.0% |
Czech | 556 | 334,562 | 4,387,470 | 0.2% |
Chuvash | 2,820 | 2,375,948 | 24,168,622 | 1.1% |
Swedish | 787 | 1,344,054 | 16,520,152 | 0.7% |
Estonian | 95 | 192,493 | 2,158,315 | 0.1% |
Japanese | 103 | 31,512 | 453,279 | 0.0% |
Multilingual | 12 | 219,642 | 5,034,965 | 0.2% |
Dialect | 2,014 | 125,156 | 599,258 | 0.0% |
Educational | 1,247 | 1,184,926 | 13,761,608 | 0.6% |
From 2 to 15 | 75 | 413,781 | 4,408,536 | 0.2% |
Poetry | 101,521 | 1,336,822 | 13,879,514 | 0.6% |
Russian classics β | 27,289 | 1,544,467 | 18,556,005 | 0.8% |
Historical | 11,910 | 833,227 | 15,427,082 | 0.7% |
Old East Slavic | 337 | — | 881,706 | 0.0% |
Inscriptions | 663 | — | 5,228 | 0.0% |
Birchbark letters | 1,249 | 1,249 | 23,932 | 0.0% |
Middle Russian | 8,242 | 399,642 | 9,251,633 | 0.4% |
Church Slavonic | 1,419 | 432,336 | 5,264,583 | 0.2% |
Panchronic | 141,035 | 30,890,027 | 384,096,728 | 17.4% |
Total | 6,385,313 | 172,309,109 | 2,205,866,519 | 100% |
Text types
Texts within the main corpus by type and other meta features
Text type | Number of texts | Number of sentences | Number of tokens | Percentage of tokens |
---|---|---|---|---|
Non-fiction | 120,563 | 16,724,139 | 223,140,501 | 59.5% |
Fiction | 10,967 | 14,998,103 | 151,878,171 | 40.5% |
Total | 131,530 | 31,722,242 | 375,018,672 | 100% |
Fiction
Genre | Number of texts | Number of sentences | Number of tokens | Percentage of tokens |
---|---|---|---|---|
Crime | 139 | 860,979 | 7,722,512 | 5.0% |
Children's literature | 848 | 764,249 | 6,635,985 | 4.3% |
Nonfiction | 462 | 1,083,362 | 12,453,680 | 8.1% |
Drama | 307 | 617,011 | 3,155,297 | 2.0% |
Historical prose | 282 | 1,319,374 | 14,437,611 | 9.4% |
Love story | 55 | 169,273 | 1,542,336 | 1.0% |
No genre | 6,094 | 8,191,709 | 86,466,150 | 56.1% |
Transliteration | 16 | 13,415 | 185,172 | 0.1% |
Adventure | 280 | 570,595 | 5,828,821 | 3.8% |
Miscellaneous | 80 | 27,709 | 351,910 | 0.2% |
Sentimental fiction | 30 | 10,867 | 167,334 | 0.1% |
Sci-fi | 733 | 999,272 | 9,724,283 | 6.3% |
Humour and satire | 1,569 | 604,686 | 5,585,078 | 3.6% |
Total | 10,895 | 15,232,501 | 154,256,169 | 100% |
Non-fiction
Domain | Number of texts | Number of sentences | Number of tokens | Percentage of tokens |
---|---|---|---|---|
Day-to-day life | 6,308 | 3,270,147 | 33,360,180 | 14.7% |
Official and business | 3,534 | 332,593 | 5,102,771 | 2.3% |
Technical | 1,210 | 120,232 | 1,621,124 | 0.7% |
Journalism | 97,896 | 9,983,031 | 136,953,541 | 60.5% |
Advertising | 2,153 | 84,875 | 853,096 | 0.4% |
Academic | 7,759 | 2,442,550 | 39,858,190 | 17.6% |
Theological | 1,219 | 373,689 | 5,298,688 | 2.3% |
Electronic communication | 888 | 352,547 | 3,474,946 | 1.5% |
Total | 120,967 | 16,959,664 | 226,522,536 | 100% |
Text topic | Number of texts | Number of sentences | Number of tokens | Percentage of tokens |
---|---|---|---|---|
Administration and management | 17,352 | 1,432,831 | 17,313,749 | 4.6% |
Army and armed conflict | 12,702 | 1,271,972 | 15,623,927 | 4.2% |
Archaeology | 21 | 2,034 | 29,368 | 0.0% |
Astrology, parapsychology, esoterica | 432 | 101,154 | 1,035,185 | 0.3% |
Astronomy | 449 | 41,821 | 649,728 | 0.2% |
Business, commerce, economics, finance | 12,339 | 772,790 | 10,336,228 | 2.8% |
Biology | 1,202 | 225,567 | 3,512,071 | 0.9% |
Military affairs | 12 | 12,459 | 235,429 | 0.1% |
Geography | 461 | 219,107 | 3,661,940 | 1.0% |
Geodesy | 1 | 613 | 15,250 | 0.0% |
Geology | 631 | 132,920 | 1,876,853 | 0.5% |
Mining industry | 393 | 27,414 | 422,038 | 0.1% |
Home and home economy | 1,325 | 92,597 | 1,122,868 | 0.3% |
Leisure and entertainment | 5,878 | 479,007 | 4,835,165 | 1.3% |
Natural science | 685 | 203,688 | 2,357,321 | 0.6% |
Natural history | 30 | 13,663 | 209,619 | 0.1% |
Health and medicine | 6,114 | 532,349 | 6,607,054 | 1.8% |
IT | 665 | 85,394 | 1,295,556 | 0.3% |
Art and culture | 18,094 | 3,370,208 | 39,702,197 | 10.6% |
Art history | 122 | 37,886 | 572,173 | 0.2% |
history | 5,236 | 1,792,984 | 27,041,126 | 7.2% |
Crime | 10,700 | 376,771 | 3,899,899 | 1.0% |
Culturology | 355 | 128,026 | 2,054,761 | 0.5% |
Light industry, food industry | 329 | 24,575 | 372,466 | 0.1% |
Forestry | 94 | 9,848 | 146,354 | 0.0% |
Logic | 1 | 3,464 | 51,840 | 0.0% |
Mathematics | 218 | 43,041 | 610,062 | 0.2% |
Machinery | 25 | 2,026 | 30,965 | 0.0% |
Metallurgy | 21 | 2,098 | 32,409 | 0.0% |
Science and technology | 11,190 | 2,278,086 | 36,021,595 | 9.6% |
Education | 4,126 | 671,779 | 7,440,823 | 2.0% |
Politics and society | 34,380 | 4,056,351 | 53,209,214 | 14.2% |
Political science | 18 | 7,301 | 117,753 | 0.0% |
Law | 3,701 | 311,517 | 4,689,965 | 1.2% |
Nature | 4,582 | 490,472 | 5,609,959 | 1.5% |
Industry | 5,093 | 354,633 | 4,366,983 | 1.2% |
Accidents | 230 | 9,838 | 93,509 | 0.0% |
Psychology | 706 | 170,791 | 2,635,726 | 0.7% |
Travel | 2,330 | 967,866 | 12,788,062 | 3.4% |
Religion | 6,972 | 1,114,064 | 14,750,077 | 3.9% |
Agriculture | 2,150 | 196,473 | 2,310,054 | 0.6% |
Sociology | 485 | 125,290 | 1,976,834 | 0.5% |
Sport | 4,200 | 292,914 | 3,564,947 | 0.9% |
Statistics | 368 | 15,999 | 230,886 | 0.1% |
Construction, architecture | 2,237 | 175,700 | 2,078,298 | 0.6% |
Technology | 8,254 | 598,351 | 7,527,780 | 2.0% |
Transport | 4,984 | 218,267 | 2,348,824 | 0.6% |
Physics | 1,338 | 125,083 | 1,871,882 | 0.5% |
Philology | 976 | 361,234 | 5,710,387 | 1.5% |
Philosophy | 880 | 541,937 | 9,239,636 | 2.5% |
Chemical industry | 108 | 8,070 | 115,041 | 0.0% |
Chemistry | 1,162 | 126,142 | 1,703,288 | 0.5% |
Private life | 21,142 | 4,582,749 | 48,619,037 | 12.9% |
Electronics | 748 | 47,504 | 671,773 | 0.2% |
Energy industry | 177 | 18,409 | 278,095 | 0.1% |
Total | 218,424 | 29,305,127 | 375,624,029 | 100% |
Dates
Texts within the main corpus by dates created
Date | Number of texts | Number of sentences | Number of tokens | Percentage of tokens |
---|---|---|---|---|
1701 - 1750 | 379 | 63,020 | 1,209,204 | 0.3% |
1751 - 1800 | 2,256 | 340,784 | 6,302,572 | 1.5% |
1801 - 1850 | 3,409 | 1,416,154 | 21,234,884 | 4.9% |
1851 - 1900 | 5,202 | 5,341,224 | 71,332,917 | 16.4% |
1901 - 1950 | 58,156 | 10,301,226 | 116,617,408 | 26.9% |
1951 - 2000 | 23,004 | 11,615,619 | 125,667,914 | 28.9% |
2001 - 2022 | 42,991 | 7,603,444 | 91,884,926 | 21.2% |
Total | 135,397 | 36,681,471 | 434,249,825 | 100% |
Parts of speech
Tokens by part of speech (Disambiguated corpus only)
Part of speech | Number of tokens | Percentage of tokens |
---|---|---|
noun | 1,722,425 | 28.7% |
Adjective | 511,009 | 8.5% |
Numeral | 102,793 | 1.7% |
of these, recorded in writing | 43,001 | 0.7% |
of these, recorded in numbers | 59,792 | 1.0% |
numeral adjective | 24,628 | 0.4% |
Verb | 1,014,087 | 16.9% |
Adverb | 254,085 | 4.2% |
Predicative | 42,806 | 0.7% |
Parenthesis | 26,766 | 0.4% |
Pronoun | 471,979 | 7.9% |
Adjectival pronoun | 280,989 | 4.7% |
Adverbial pronoun | 130,447 | 2.2% |
Predicative pronoun (некого, нечего) | 678 | 0.0% |
Preposition | 627,529 | 10.5% |
Conjunction | 476,107 | 7.9% |
Particle | 266,854 | 4.4% |
Interjection | 8,665 | 0.1% |
Initital | 10,128 | 0.2% |
Other (foreign words, onomatopoeia) | 31,409 | 0.5% |
Total | 6,003,384 | 100% |