Corpora statistics

This page presents general statistical information about the corpora included in the Russian National Corpus (RNC).

For some RNC corpora, extended statistics are available, including not only data on the number of texts and words but also charts showing the distribution of metadata attributes, geographic maps, and volume distribution graphs by country and region (for corpora with regional annotation). To access corpus statistics, click icon  in the corpus header or click on the corpus name on this page.

In corpora with extended statistics, it is also possible to compare users’ subcorpora with the entire corpus. To view comparative data, click icon   in the subcorpus header.

Number of texts

Texts by subcorpora

CorporaNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Main131,48830,499,014374,449,97517.0%
including manually disambiguated2,170519,7266,106,0690.3%
Media2,838,95357,862,754850,630,55738.5%
National media2,728,68855,215,073815,141,02936.9%
Regional & international110,2652,647,68135,489,5281.6%
SynTagRus1,304109,8861,568,0270.1%
Social networks1,768,13414,443,641161,432,4527.3%
Spoken4,5982,052,20614,854,0330.7%
Accentological1,340,74613,534,835135,550,9816.1%
Multimedia1,3831,019,7685,763,8810.3%
MultiPARC5691,341528,8510.0%
Russian2648,236299,5200.0%
English-Russian3043,105229,3310.0%
Parallel13,78016,591,904213,608,8149.7%
English1,5563,476,86451,982,4392.4%
Armenian28126,6361,570,7350.1%
Bashkir124124,270550,3870.0%
Belarusian3121,162,86810,916,6970.5%
Bulgarian59418,9865,159,9010.2%
Buryat730,750401,5160.0%
Veps98940,780343,1330.0%
Spanish157465,3246,922,2820.3%
Italian126302,2644,930,9700.2%
Karelian2,355125,7021,223,7600.1%
Chinese1,075253,5004,422,7470.2%
Korean18512,30073,7520.0%
Latvian245410,4384,398,5640.2%
Lithuanian6572,244702,4710.0%
German2992,234,02432,276,7551.5%
Polish54501,8006,355,6290.3%
Portuguese3888,5721,602,4120.1%
Romanian3160,140903,3750.0%
Serbian37144,0271,903,1760.1%
Slovene53173,1721,989,6410.1%
Ukrainian865919,4269,383,7740.4%
Finnish320299,1843,741,4310.2%
French67498,1807,631,4300.3%
Khakas331126,7101,194,9710.1%
Hindi β99,292122,3470.0%
Romani2016,240185,1420.0%
Czech556334,5624,387,4700.2%
Chuvash2,8202,375,94824,168,6221.1%
Swedish7871,344,05416,520,1520.7%
Estonian95192,4932,154,8890.1%
Japanese10331,512453,2790.0%
Multilingual12219,6425,034,9650.2%
Dialect2,014125,156599,2580.0%
Educational1,2471,184,92613,761,6080.6%
From 2 to 1575413,7814,408,5360.2%
Poetry101,5211,336,82213,879,5140.6%
Russian classics β27,2891,544,46718,556,0050.8%
Historical11,996833,22715,427,8930.7%
Old East Slavic337881,7060.0%
Inscriptions7496,0390.0%
Birchbark letters1,2491,24923,9320.0%
Middle Russian8,242399,6429,251,6330.4%
Church Slavonic1,419432,3365,264,5830.2%
Panchronic141,03530,890,027384,096,72817.4%
Total6,385,619172,533,7552,209,117,113100%

Text types

Texts within the main corpus by type and other meta features

Text typeNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Non-fiction120,56316,724,139223,140,50159.5%
Fiction10,96714,998,103151,878,17140.5%
Total131,53031,722,242375,018,672100%

Fiction

GenreNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Crime139860,9797,722,5125.0%
Children's literature848764,2496,635,9854.3%
Nonfiction4621,083,36212,453,6808.1%
Drama307617,0113,155,2972.0%
Historical prose2821,319,37414,437,6119.4%
Love story55169,2731,542,3361.0%
No genre6,0948,191,70986,466,15056.1%
Transliteration1613,415185,1720.1%
Adventure280570,5955,828,8213.8%
Miscellaneous8027,709351,9100.2%
Sentimental fiction3010,867167,3340.1%
Sci-fi733999,2729,724,2836.3%
Humour and satire1,569604,6865,585,0783.6%
Total10,89515,232,501154,256,169100%

Non-fiction

DomainNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Day-to-day life6,3083,270,14733,360,18014.7%
Official and business3,534332,5935,102,7712.3%
Technical1,210120,2321,621,1240.7%
Journalism97,8969,983,031136,953,54160.5%
Advertising2,15384,875853,0960.4%
Academic7,7592,442,55039,858,19017.6%
Theological1,219373,6895,298,6882.3%
Electronic communication888352,5473,474,9461.5%
Total120,96716,959,664226,522,536100%
Text topicNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Administration and management17,3521,432,83117,313,7494.6%
Army and armed conflict12,7021,271,97215,623,9274.2%
Archaeology212,03429,3680.0%
Astrology, parapsychology, esoterica432101,1541,035,1850.3%
Astronomy44941,821649,7280.2%
Business, commerce, economics, finance12,339772,79010,336,2282.8%
Biology1,202225,5673,512,0710.9%
Military affairs1212,459235,4290.1%
Geography461219,1073,661,9401.0%
Geodesy161315,2500.0%
Geology631132,9201,876,8530.5%
Mining industry39327,414422,0380.1%
Home and home economy1,32592,5971,122,8680.3%
Leisure and entertainment5,878479,0074,835,1651.3%
Natural science685203,6882,357,3210.6%
Natural history3013,663209,6190.1%
Health and medicine6,114532,3496,607,0541.8%
IT66585,3941,295,5560.3%
Art and culture18,0943,370,20839,702,19710.6%
Art history12237,886572,1730.2%
history5,2361,792,98427,041,1267.2%
Crime10,700376,7713,899,8991.0%
Culturology355128,0262,054,7610.5%
Light industry, food industry32924,575372,4660.1%
Forestry949,848146,3540.0%
Logic13,46451,8400.0%
Mathematics21843,041610,0620.2%
Machinery252,02630,9650.0%
Metallurgy212,09832,4090.0%
Science and technology11,1902,278,08636,021,5959.6%
Education4,126671,7797,440,8232.0%
Politics and society34,3804,056,35153,209,21414.2%
Political science187,301117,7530.0%
Law3,701311,5174,689,9651.2%
Nature4,582490,4725,609,9591.5%
Industry5,093354,6334,366,9831.2%
Accidents2309,83893,5090.0%
Psychology706170,7912,635,7260.7%
Travel2,330967,86612,788,0623.4%
Religion6,9721,114,06414,750,0773.9%
Agriculture2,150196,4732,310,0540.6%
Sociology485125,2901,976,8340.5%
Sport4,200292,9143,564,9470.9%
Statistics36815,999230,8860.1%
Construction, architecture2,237175,7002,078,2980.6%
Technology8,254598,3517,527,7802.0%
Transport4,984218,2672,348,8240.6%
Physics1,338125,0831,871,8820.5%
Philology976361,2345,710,3871.5%
Philosophy880541,9379,239,6362.5%
Chemical industry1088,070115,0410.0%
Chemistry1,162126,1421,703,2880.5%
Private life21,1424,582,74948,619,03712.9%
Electronics74847,504671,7730.2%
Energy industry17718,409278,0950.1%
Total218,42429,305,127375,624,029100%

Dates

Texts within the main corpus by dates created

DateNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
1701 - 175037963,0201,209,2040.3%
1751 - 18002,256340,7846,302,5721.5%
1801 - 18503,4091,416,15421,234,8844.9%
1851 - 19005,2025,341,22471,332,91716.4%
1901 - 195058,15610,301,226116,617,40826.9%
1951 - 200023,00411,615,619125,667,91428.9%
2001 - 202242,9917,603,44491,884,92621.2%
Total135,39736,681,471434,249,825100%

Parts of speech

Tokens by part of speech (Disambiguated corpus only)

Part of speechNumber of tokensPercentage of tokens
noun1,722,42528.7%
Adjective511,0098.5%
Numeral102,7931.7%
of these, recorded in writing43,0010.7%
of these, recorded in numbers59,7921.0%
numeral adjective24,6280.4%
Verb1,014,08716.9%
Adverb254,0854.2%
Predicative42,8060.7%
Parenthesis26,7660.4%
Pronoun471,9797.9%
Adjectival pronoun280,9894.7%
Adverbial pronoun130,4472.2%
Predicative pronoun (некого, нечего)6780.0%
Preposition627,52910.5%
Conjunction476,1077.9%
Particle266,8544.4%
Interjection8,6650.1%
Initital10,1280.2%
Other (foreign words, onomatopoeia)31,4090.5%
Total6,003,384100%

Updated on