Corpora statistics

Number of texts

Texts by subcorpora

CorporaNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Main131,53031,722,242375,018,67225.0%
including manually disambiguated2,170520,5096,003,3900.4%
Media2,660,03852,482,878765,546,44451.1%
National media60,1961,917,99524,512,3761.6%
SynTagRus76895,8201,350,8120.1%
Dialect1,45289,454485,4000.0%
Educational22965,664664,8040.0%
Parallel5,65011,409,490151,926,33210.1%
Poetry94,9321,260,16313,000,3900.9%
Spoken4,2101,877,35713,399,9370.9%
Accentological1,333,80713,296,847133,303,3508.9%
Multimedia1,227984,8625,449,0750.4%
MultiPARC (Russian)2136,566229,2070.0%
MultiPARC (English-Russian)3043,105229,3310.0%
Old East Slavic20139,379652,6980.0%
Birchbark letters1,2034,81258,8860.0%
Middle Russian7,493368,7198,561,2610.6%
Church Slavonic1,160375,1814,476,0060.3%
Total4,304,147116,070,5341,498,864,981100%

Text types

Texts within the main corpus by type and other meta features

Text typeNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Non-fiction120,56316,724,139223,140,50159.5%
Fiction10,96714,998,103151,878,17140.5%
Total131,53031,722,242375,018,672100%

Fiction

GenreNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Crime139860,9797,722,5125.0%
Children's literature848764,2496,635,9854.3%
Documentary4621,083,36212,453,6808.1%
Drama307617,0113,155,2972.0%
Historical prose2821,319,37414,437,6119.4%
Love story55169,2731,542,3361.0%
No genre6,0948,191,70986,466,15056.1%
Translation1613,415185,1720.1%
Adventure280570,5955,828,8213.8%
Miscellaneous8027,709351,9100.2%
Sentimental fiction3010,867167,3340.1%
Sci-fi733999,2729,724,2836.3%
Humour and satire1,569604,6865,585,0783.6%
Total10,89515,232,501154,256,169100%

Non-fiction

DomainNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Day-to-day life6,3083,270,14733,360,18014.7%
Official and business3,534332,5935,102,7712.3%
Technical1,210120,2321,621,1240.7%
Journalism97,8969,983,031136,953,54160.5%
Advertising2,15384,875853,0960.4%
Academic7,7592,442,55039,858,19017.6%
Theological1,219373,6895,298,6882.3%
Electronic communication888352,5473,474,9461.5%
Total120,96716,959,664226,522,536100%
Text topicNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Administration and management17,3521,432,83117,313,7494.6%
Army and armed conflict12,7021,271,97215,623,9274.2%
Archaeology212,03429,3680.0%
Astrology, parapsychology, esoterica432101,1541,035,1850.3%
Astronomy44941,821649,7280.2%
Business, commerce, economics, finance12,339772,79010,336,2282.8%
Biology1,202225,5673,512,0710.9%
Military affairs1212,459235,4290.1%
Geography461219,1073,661,9401.0%
Geodesy161315,2500.0%
Geology631132,9201,876,8530.5%
Mining industry39327,414422,0380.1%
Home and home economy1,32592,5971,122,8680.3%
Leisure and entertainment5,878479,0074,835,1651.3%
Natural science685203,6882,357,3210.6%
Natural history3013,663209,6190.1%
Health and medicine6,114532,3496,607,0541.8%
IT66585,3941,295,5560.3%
Art and culture18,0943,370,20839,702,19710.6%
Art history12237,886572,1730.2%
History5,2361,792,98427,041,1267.2%
Crime10,700376,7713,899,8991.0%
Culturology355128,0262,054,7610.5%
Light industry, food industry32924,575372,4660.1%
Forestry949,848146,3540.0%
Logic13,46451,8400.0%
Mathematics21843,041610,0620.2%
Machinery252,02630,9650.0%
Metallurgy212,09832,4090.0%
Science and technology11,1902,278,08636,021,5959.6%
Education4,126671,7797,440,8232.0%
Politics and society34,3804,056,35153,209,21414.2%
Political science187,301117,7530.0%
Law3,701311,5174,689,9651.2%
Nature4,582490,4725,609,9591.5%
Industry5,093354,6334,366,9831.2%
Accidents2309,83893,5090.0%
Psychology706170,7912,635,7260.7%
Travel2,330967,86612,788,0623.4%
Religion6,9721,114,06414,750,0773.9%
Agriculture2,150196,4732,310,0540.6%
Sociology485125,2901,976,8340.5%
Sport4,200292,9143,564,9470.9%
Statistics36815,999230,8860.1%
Construction, architecture2,237175,7002,078,2980.6%
Technology8,254598,3517,527,7802.0%
Transport4,984218,2672,348,8240.6%
Physics1,338125,0831,871,8820.5%
Philology976361,2345,710,3871.5%
Philosophy880541,9379,239,6362.5%
Chemical industry1088,070115,0410.0%
Chemistry1,162126,1421,703,2880.5%
Private life21,1424,582,74948,619,03712.9%
Electronics74847,504671,7730.2%
Energy industry17718,409278,0950.1%
Total218,42429,305,127375,624,029100%

Dates

Texts within the main corpus by dates created

DateNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
1701 - 175037963,0201,209,2040.3%
1751 - 18002,256340,7846,302,5721.5%
1801 - 18503,4091,416,15421,234,8844.9%
1851 - 19005,2025,341,22471,332,91716.4%
1901 - 195058,15610,301,226116,617,40826.9%
1951 - 200023,00411,615,619125,667,91428.9%
2001 - 202242,9917,603,44491,884,92621.2%
Total135,39736,681,471434,249,825100%

Parts of speech

Tokens by part of speech (Disambiguated corpus only)

Part of speechNumber of tokensPercentage of tokens
Noun1,722,42528.7%
Adjective511,0098.5%
Numeral102,7931.7%
of these, recorded in writing43,0010.7%
of these, recorded in numbers59,7921.0%
Numeral adjective24,6280.4%
Verb1,014,08716.9%
Adverb254,0854.2%
Predicative42,8060.7%
Parenthesis26,7660.4%
Pronoun471,9797.9%
Adjectival pronoun280,9894.7%
Adverbial pronoun130,4472.2%
Predicative pronoun (некого, нечего)6780.0%
Preposition627,52910.5%
Conjunction476,1077.9%
Particle266,8544.4%
Interjection8,6650.1%
Initital10,1280.2%
Other (foreign words, onomatopoeia)31,4090.5%
Total6,003,384100%

Updated at