Corpora statistics

Number of texts

Texts by subcorpora

CorporaNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Main131,48830,499,014374,449,97517.9%
including manually disambiguated2,170519,7266,106,0690.3%
Media2,720,22254,400,873790,058,82037.9%
National media2,660,02652,482,878765,546,44436.7%
Regional & international60,1961,917,99524,512,3761.2%
SynTagRus1,304107,1291,529,5010.1%
Social networks1,753,27414,197,458157,921,5737.6%
Spoken4,3301,962,88913,963,1310.7%
Accentological1,335,74013,393,954134,239,0986.4%
Multimedia1,3641,009,4635,696,0260.3%
MultiPARC5179,672458,5310.0%
Russian2136,567229,2000.0%
English-Russian3043,105229,3310.0%
Parallel6,72213,100,414173,288,3658.3%
English1,1893,056,51044,477,9582.1%
Armenian28126,6361,570,7380.1%
Bashkir124124,270550,3870.0%
Belarusian3121,162,86810,916,6970.5%
Bulgarian59418,9885,159,9140.2%
Buryat730,750401,5160.0%
Spanish139359,2185,385,7350.3%
Italian126302,2644,930,9910.2%
Chinese1,075253,5004,422,7470.2%
Korean18512,30073,7520.0%
Latvian245410,4424,400,0170.2%
Lithuanian6572,244702,4480.0%
German2872,118,53030,544,4511.5%
Polish54501,8006,355,6300.3%
Portuguese2825,136566,6750.0%
Romanian3160,140903,3770.0%
Serbian37144,0271,903,1780.1%
Slovene53173,1721,989,7490.1%
Ukrainian865919,4269,383,7740.4%
Finnish320299,1843,741,4310.2%
French61462,1847,123,5340.3%
Hindi99,292123,1760.0%
Czech529301,3443,947,0510.2%
Swedish7871,344,05416,520,1590.8%
Estonian95192,4932,158,3150.1%
Multilingual12219,6425,034,9650.2%
Dialect2,014599,2580.0%
Educational22966,312664,8080.0%
From 2 to 1575431,1934,419,4200.2%
Poetry96,7021,208,25113,404,8360.6%
Russian classics β25,8821,513,67417,549,8850.8%
Historical10,406798,92014,899,6220.7%
Old East Slavic215781,3220.0%
Birchbark letters1,21223,3230.0%
Middle Russian7,560383,7698,843,3550.4%
Church Slavonic1,419415,1515,251,6220.3%
Panchronic140,32630,944,770383,815,69718.4%
Total6,230,129163,713,9862,086,958,546100%

Text types

Texts within the main corpus by type and other meta features

Text typeNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Non-fiction120,56316,724,139223,140,50159.5%
Fiction10,96714,998,103151,878,17140.5%
Total131,53031,722,242375,018,672100%

Fiction

GenreNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Crime139860,9797,722,5125.0%
Children's literature848764,2496,635,9854.3%
Nonfiction4621,083,36212,453,6808.1%
Drama307617,0113,155,2972.0%
Historical prose2821,319,37414,437,6119.4%
Love story55169,2731,542,3361.0%
No genre6,0948,191,70986,466,15056.1%
Translation1613,415185,1720.1%
Adventure280570,5955,828,8213.8%
Miscellaneous8027,709351,9100.2%
Sentimental fiction3010,867167,3340.1%
Sci-fi733999,2729,724,2836.3%
Humour and satire1,569604,6865,585,0783.6%
Total10,89515,232,501154,256,169100%

Non-fiction

DomainNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Day-to-day life6,3083,270,14733,360,18014.7%
Official and business3,534332,5935,102,7712.3%
Technical1,210120,2321,621,1240.7%
Journalism97,8969,983,031136,953,54160.5%
Advertising2,15384,875853,0960.4%
Academic7,7592,442,55039,858,19017.6%
Theological1,219373,6895,298,6882.3%
Electronic communication888352,5473,474,9461.5%
Total120,96716,959,664226,522,536100%
Text topicNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Administration and management17,3521,432,83117,313,7494.6%
Army and armed conflict12,7021,271,97215,623,9274.2%
Archaeology212,03429,3680.0%
Astrology, parapsychology, esoterica432101,1541,035,1850.3%
Astronomy44941,821649,7280.2%
Business, commerce, economics, finance12,339772,79010,336,2282.8%
Biology1,202225,5673,512,0710.9%
Military affairs1212,459235,4290.1%
Geography461219,1073,661,9401.0%
Geodesy161315,2500.0%
Geology631132,9201,876,8530.5%
Mining industry39327,414422,0380.1%
Home and home economy1,32592,5971,122,8680.3%
Leisure and entertainment5,878479,0074,835,1651.3%
Natural science685203,6882,357,3210.6%
Natural history3013,663209,6190.1%
Health and medicine6,114532,3496,607,0541.8%
IT66585,3941,295,5560.3%
Art and culture18,0943,370,20839,702,19710.6%
Art history12237,886572,1730.2%
History5,2361,792,98427,041,1267.2%
Crime10,700376,7713,899,8991.0%
Culturology355128,0262,054,7610.5%
Light industry, food industry32924,575372,4660.1%
Forestry949,848146,3540.0%
Logic13,46451,8400.0%
Mathematics21843,041610,0620.2%
Machinery252,02630,9650.0%
Metallurgy212,09832,4090.0%
Science and technology11,1902,278,08636,021,5959.6%
Education4,126671,7797,440,8232.0%
Politics and society34,3804,056,35153,209,21414.2%
Political science187,301117,7530.0%
Law3,701311,5174,689,9651.2%
Nature4,582490,4725,609,9591.5%
Industry5,093354,6334,366,9831.2%
Accidents2309,83893,5090.0%
Psychology706170,7912,635,7260.7%
Travel2,330967,86612,788,0623.4%
Religion6,9721,114,06414,750,0773.9%
Agriculture2,150196,4732,310,0540.6%
Sociology485125,2901,976,8340.5%
Sport4,200292,9143,564,9470.9%
Statistics36815,999230,8860.1%
Construction, architecture2,237175,7002,078,2980.6%
Technology8,254598,3517,527,7802.0%
Transport4,984218,2672,348,8240.6%
Physics1,338125,0831,871,8820.5%
Philology976361,2345,710,3871.5%
Philosophy880541,9379,239,6362.5%
Chemical industry1088,070115,0410.0%
Chemistry1,162126,1421,703,2880.5%
Private life21,1424,582,74948,619,03712.9%
Electronics74847,504671,7730.2%
Energy industry17718,409278,0950.1%
Total218,42429,305,127375,624,029100%

Dates

Texts within the main corpus by dates created

DateNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
1701 - 175037963,0201,209,2040.3%
1751 - 18002,256340,7846,302,5721.5%
1801 - 18503,4091,416,15421,234,8844.9%
1851 - 19005,2025,341,22471,332,91716.4%
1901 - 195058,15610,301,226116,617,40826.9%
1951 - 200023,00411,615,619125,667,91428.9%
2001 - 202242,9917,603,44491,884,92621.2%
Total135,39736,681,471434,249,825100%

Parts of speech

Tokens by part of speech (Disambiguated corpus only)

Part of speechNumber of tokensPercentage of tokens
Noun1,722,42528.7%
Adjective511,0098.5%
Numeral102,7931.7%
of these, recorded in writing43,0010.7%
of these, recorded in numbers59,7921.0%
Numeral adjective24,6280.4%
Verb1,014,08716.9%
Adverb254,0854.2%
Predicative42,8060.7%
Parenthesis26,7660.4%
Pronoun471,9797.9%
Adjectival pronoun280,9894.7%
Adverbial pronoun130,4472.2%
Predicative pronoun (некого, нечего)6780.0%
Preposition627,52910.5%
Conjunction476,1077.9%
Particle266,8544.4%
Interjection8,6650.1%
Initital10,1280.2%
Other (foreign words, onomatopoeia)31,4090.5%
Total6,003,384100%

Updated on