Corpora statistics

Number of texts

Texts by subcorpora

CorporaNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Main131,48830,499,014374,449,97517.0%
including manually disambiguated2,170519,7266,106,0690.3%
Media2,838,95357,862,754850,630,55738.6%
National media2,728,68855,215,073815,141,02937.0%
Regional & international110,2652,647,68135,489,5281.6%
SynTagRus1,304109,8861,568,0270.1%
Social networks1,768,13414,443,641161,432,4527.3%
Spoken4,5142,018,40114,554,0520.7%
Accentological1,340,74613,534,835135,550,9816.1%
Multimedia1,3831,019,7685,763,8810.3%
MultiPARC5179,672458,5310.0%
Russian2136,567229,2000.0%
English-Russian3043,105229,3310.0%
Parallel13,64916,412,732210,729,3329.6%
English1,4383,386,54850,422,8892.3%
Armenian28126,6361,570,7350.1%
Bashkir124124,270550,3870.0%
Belarusian3121,162,86810,916,6970.5%
Bulgarian59418,9865,159,9010.2%
Buryat730,750401,5160.0%
Veps98940,780343,1330.0%
Spanish150416,4746,148,1570.3%
Italian126302,2644,930,9700.2%
Karelian2,355125,7021,223,7600.1%
Chinese1,075253,5004,422,7470.2%
Korean18512,30073,7520.0%
Latvian245410,4384,398,5640.2%
Lithuanian6572,244702,4710.0%
German2942,194,00431,742,1051.4%
Polish54501,8006,355,6290.3%
Portuguese3888,5721,602,4120.1%
Romanian3160,140903,3750.0%
Serbian37144,0271,903,1760.1%
Slovene53173,1721,989,6410.1%
Ukrainian865919,4269,383,7740.4%
Finnish320299,1843,741,4310.2%
French67498,1807,631,4300.3%
Khakas331126,7101,194,9710.1%
Hindi β99,292122,3470.0%
Romani β1916,254170,5590.0%
Czech556334,5624,387,4700.2%
Chuvash2,8202,375,94824,168,6221.1%
Swedish7871,344,05416,520,1520.7%
Estonian95192,4932,158,3150.1%
Japanese10331,512453,2790.0%
Multilingual12219,6425,034,9650.2%
Dialect2,014125,156599,2580.0%
Educational1,2471,184,92613,761,6080.6%
From 2 to 1575413,7814,408,5360.2%
Poetry101,5211,336,82213,879,5140.6%
Russian classics β27,2891,544,46718,556,0050.8%
Historical11,910833,22715,427,0820.7%
Old East Slavic337881,7060.0%
Inscriptions6635,2280.0%
Birchbark letters1,2491,24923,9320.0%
Middle Russian8,242399,6429,251,6330.4%
Church Slavonic1,419432,3365,264,5830.2%
Panchronic141,03530,890,027384,096,72817.4%
Total6,385,313172,309,1092,205,866,519100%

Text types

Texts within the main corpus by type and other meta features

Text typeNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Non-fiction120,56316,724,139223,140,50159.5%
Fiction10,96714,998,103151,878,17140.5%
Total131,53031,722,242375,018,672100%

Fiction

GenreNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Crime139860,9797,722,5125.0%
Children's literature848764,2496,635,9854.3%
Nonfiction4621,083,36212,453,6808.1%
Drama307617,0113,155,2972.0%
Historical prose2821,319,37414,437,6119.4%
Love story55169,2731,542,3361.0%
No genre6,0948,191,70986,466,15056.1%
Transliteration1613,415185,1720.1%
Adventure280570,5955,828,8213.8%
Miscellaneous8027,709351,9100.2%
Sentimental fiction3010,867167,3340.1%
Sci-fi733999,2729,724,2836.3%
Humour and satire1,569604,6865,585,0783.6%
Total10,89515,232,501154,256,169100%

Non-fiction

DomainNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Day-to-day life6,3083,270,14733,360,18014.7%
Official and business3,534332,5935,102,7712.3%
Technical1,210120,2321,621,1240.7%
Journalism97,8969,983,031136,953,54160.5%
Advertising2,15384,875853,0960.4%
Academic7,7592,442,55039,858,19017.6%
Theological1,219373,6895,298,6882.3%
Electronic communication888352,5473,474,9461.5%
Total120,96716,959,664226,522,536100%
Text topicNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Administration and management17,3521,432,83117,313,7494.6%
Army and armed conflict12,7021,271,97215,623,9274.2%
Archaeology212,03429,3680.0%
Astrology, parapsychology, esoterica432101,1541,035,1850.3%
Astronomy44941,821649,7280.2%
Business, commerce, economics, finance12,339772,79010,336,2282.8%
Biology1,202225,5673,512,0710.9%
Military affairs1212,459235,4290.1%
Geography461219,1073,661,9401.0%
Geodesy161315,2500.0%
Geology631132,9201,876,8530.5%
Mining industry39327,414422,0380.1%
Home and home economy1,32592,5971,122,8680.3%
Leisure and entertainment5,878479,0074,835,1651.3%
Natural science685203,6882,357,3210.6%
Natural history3013,663209,6190.1%
Health and medicine6,114532,3496,607,0541.8%
IT66585,3941,295,5560.3%
Art and culture18,0943,370,20839,702,19710.6%
Art history12237,886572,1730.2%
history5,2361,792,98427,041,1267.2%
Crime10,700376,7713,899,8991.0%
Culturology355128,0262,054,7610.5%
Light industry, food industry32924,575372,4660.1%
Forestry949,848146,3540.0%
Logic13,46451,8400.0%
Mathematics21843,041610,0620.2%
Machinery252,02630,9650.0%
Metallurgy212,09832,4090.0%
Science and technology11,1902,278,08636,021,5959.6%
Education4,126671,7797,440,8232.0%
Politics and society34,3804,056,35153,209,21414.2%
Political science187,301117,7530.0%
Law3,701311,5174,689,9651.2%
Nature4,582490,4725,609,9591.5%
Industry5,093354,6334,366,9831.2%
Accidents2309,83893,5090.0%
Psychology706170,7912,635,7260.7%
Travel2,330967,86612,788,0623.4%
Religion6,9721,114,06414,750,0773.9%
Agriculture2,150196,4732,310,0540.6%
Sociology485125,2901,976,8340.5%
Sport4,200292,9143,564,9470.9%
Statistics36815,999230,8860.1%
Construction, architecture2,237175,7002,078,2980.6%
Technology8,254598,3517,527,7802.0%
Transport4,984218,2672,348,8240.6%
Physics1,338125,0831,871,8820.5%
Philology976361,2345,710,3871.5%
Philosophy880541,9379,239,6362.5%
Chemical industry1088,070115,0410.0%
Chemistry1,162126,1421,703,2880.5%
Private life21,1424,582,74948,619,03712.9%
Electronics74847,504671,7730.2%
Energy industry17718,409278,0950.1%
Total218,42429,305,127375,624,029100%

Dates

Texts within the main corpus by dates created

DateNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
1701 - 175037963,0201,209,2040.3%
1751 - 18002,256340,7846,302,5721.5%
1801 - 18503,4091,416,15421,234,8844.9%
1851 - 19005,2025,341,22471,332,91716.4%
1901 - 195058,15610,301,226116,617,40826.9%
1951 - 200023,00411,615,619125,667,91428.9%
2001 - 202242,9917,603,44491,884,92621.2%
Total135,39736,681,471434,249,825100%

Parts of speech

Tokens by part of speech (Disambiguated corpus only)

Part of speechNumber of tokensPercentage of tokens
noun1,722,42528.7%
Adjective511,0098.5%
Numeral102,7931.7%
of these, recorded in writing43,0010.7%
of these, recorded in numbers59,7921.0%
numeral adjective24,6280.4%
Verb1,014,08716.9%
Adverb254,0854.2%
Predicative42,8060.7%
Parenthesis26,7660.4%
Pronoun471,9797.9%
Adjectival pronoun280,9894.7%
Adverbial pronoun130,4472.2%
Predicative pronoun (некого, нечего)6780.0%
Preposition627,52910.5%
Conjunction476,1077.9%
Particle266,8544.4%
Interjection8,6650.1%
Initital10,1280.2%
Other (foreign words, onomatopoeia)31,4090.5%
Total6,003,384100%

Updated on