Corpora statistics

This page presents general statistical information about the corpora included in the Russian National Corpus (RNC).

For some RNC corpora, extended statistics are available, including not only data on the number of texts and words but also charts showing the distribution of metadata attributes, geographic maps, and volume distribution graphs by country and region (for corpora with regional annotation). To access corpus statistics, click icon  in the corpus header or click on the corpus name on this page.

In corpora with extended statistics, it is also possible to compare users’ subcorpora with the entire corpus. To view comparative data, click icon   in the subcorpus header.

Number of texts

Texts by subcorpora

CorporaNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Main133,55431,626,103389,470,86817.5%
including manually disambiguated2,164514,1695,988,1770.3%
Media2,838,95357,862,754850,630,55738.2%
National media2,728,68855,215,073815,141,02936.6%
Regional & international110,2652,647,68135,489,5281.6%
SynTagRus1,304109,8861,568,0270.1%
Social networks1,768,13414,443,641161,432,4527.3%
Spoken4,5982,052,20614,854,0330.7%
Accentological1,342,85113,559,353135,768,7326.1%
Multimedia1,3831,019,7685,763,8810.3%
MultiPARC5691,341528,8510.0%
Russian2648,236299,5200.0%
English-Russian3043,105229,3310.0%
Parallel13,79116,619,044214,046,0709.6%
English1,5563,476,86451,982,4392.3%
Armenian28126,6361,570,7350.1%
Bashkir124124,270550,3870.0%
Belarusian3121,162,86810,916,6970.5%
Bulgarian59418,9865,159,9010.2%
Buryat730,750401,5160.0%
Veps98940,780343,1330.0%
Spanish168492,4647,359,5380.3%
Italian126302,2644,930,9700.2%
Karelian2,355125,7021,223,7600.1%
Chinese1,075253,5004,422,7470.2%
Korean18512,30073,7520.0%
Latvian245410,4384,398,5640.2%
Lithuanian6572,244702,4710.0%
German2992,234,02432,276,7551.5%
Polish54501,8006,355,6290.3%
Portuguese3888,5721,602,4120.1%
Romanian3160,140903,3750.0%
Serbian37144,0271,903,1760.1%
Slovene53173,1721,989,6410.1%
Ukrainian865919,4269,383,7740.4%
Finnish320299,1843,741,4310.2%
French67498,1807,631,4300.3%
Khakas331126,7101,194,9710.1%
Hindi β99,292122,3470.0%
Romani2016,240185,1420.0%
Czech556334,5624,387,4700.2%
Chuvash2,8202,375,94824,168,6221.1%
Swedish7871,344,05416,520,1520.7%
Estonian95192,4932,154,8890.1%
Japanese10331,512453,2790.0%
Multilingual12219,6425,034,9650.2%
Dialect2,014125,156599,2580.0%
Educational1,2471,184,92613,761,6080.6%
From 2 to 1575413,7814,408,5360.2%
Poetry103,6261,361,34014,097,2650.6%
Russian classics β27,2891,544,46718,556,0050.8%
Historical11,996833,22715,427,8930.7%
Old East Slavic337881,7060.0%
Inscriptions7496,0390.0%
Birchbark letters1,2491,24923,9320.0%
Middle Russian8,242399,6429,251,6330.4%
Church Slavonic1,419432,3365,264,5830.2%
Panchronic141,03530,890,027384,096,72817.3%
Total6,391,906173,737,0202,225,010,764100%

Text types

Texts within the main corpus by type and other meta features

Text typeNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Non-fiction122,25116,721,496231,498,00459.4%
Fiction11,30314,904,607157,972,86440.6%
Total133,55431,626,103389,470,868100%

Fiction

GenreNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Crime138817,0587,656,4524.8%
Children's literature860759,5407,006,7554.4%
Nonfiction4641,062,52812,620,5607.9%
Drama306636,1943,440,0982.1%
Historical prose2951,287,53514,918,1409.3%
Love story69190,0001,805,9761.1%
Medical prose317,773170,6430.1%
No genre6,3668,222,85790,763,91756.5%
Transliteration3043,291696,9380.4%
Adventure273519,5595,506,6673.4%
Sentimental fiction3010,463167,2550.1%
Thriller16,95060,6530.0%
Sci-fi7741,007,09110,150,0316.3%
folklore778,715180,6570.1%
Humour and satire1,560569,9445,614,4383.5%
Total11,24615,159,498160,759,180100%

Non-fiction

DomainNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Day-to-day life6,8023,214,74433,709,90414.3%
Official and business3,660353,9245,375,4632.3%
Technical1,211116,8531,639,4680.7%
Journalism98,3509,936,754140,263,01059.4%
Advertising2,15376,326844,0610.4%
Academic8,3692,565,26244,186,19418.7%
Fiction57124,6431,257,8540.5%
Theological1,218332,8745,290,0382.2%
Electronic communication877336,4843,382,1711.4%
Total122,69717,057,864235,948,163100%
Text topicNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
Administration and management17,4871,430,21217,800,8704.5%
антропология1015,284313,6670.1%
Army and armed conflict12,7781,244,13315,577,8994.0%
Archaeology212,02129,2280.0%
Astrology, parapsychology, esoterica43299,8081,035,4620.3%
Astronomy44941,100648,0360.2%
Business, commerce, economics, finance12,348741,26910,337,1852.6%
Biology1,257297,1584,732,2901.2%
Military affairs1311,495244,6850.1%
Geography470216,1063,711,1710.9%
Geodesy174615,3420.0%
Geology631128,2561,872,5590.5%
Mining industry39325,102419,2630.1%
Home and home economy1,342130,7871,925,5680.5%
Leisure and entertainment5,878457,4824,844,7251.2%
Natural science679192,8262,293,0340.6%
Natural history3013,084210,3790.1%
Health and medicine6,098498,1856,683,8891.7%
IT69182,6591,318,0160.3%
Art and culture18,7243,473,67441,890,79310.6%
Art history12236,553570,9200.1%
history5,3731,758,27027,607,7387.0%
Crime10,701367,0333,927,1821.0%
Culturology732193,2983,297,4620.8%
Light industry, food industry32923,991371,2590.1%
Forestry949,430146,0110.0%
Logic13,47851,8150.0%
Mathematics22241,315608,0270.2%
Machinery251,98730,8830.0%
Metallurgy212,07832,2880.0%
Science and technology11,8602,427,72040,645,90110.3%
Education4,146656,6077,610,1301.9%
Politics and society34,6504,016,53554,470,83113.8%
Political science187,009117,3210.0%
Law3,704301,0814,697,7511.2%
Nature4,621538,0746,278,4121.6%
Industry5,093348,1684,415,3241.1%
Accidents2379,55297,3670.0%
Psychology712176,5272,811,2040.7%
Travel2,337935,29812,756,6253.2%
Religion7,0161,055,93414,963,3833.8%
Agriculture2,186238,5623,204,0860.8%
Sociology513148,6122,440,3540.6%
Sport4,206288,3313,697,7280.9%
Statistics37418,439286,0210.1%
Construction, architecture2,258161,6752,097,4430.5%
Technology8,279585,3997,514,8221.9%
Transport4,994220,2322,441,0520.6%
Physics1,359126,2301,930,6820.5%
Philology1,097385,8226,524,2201.7%
Philosophy893502,3498,862,1642.3%
Chemical industry1088,028114,9480.0%
Chemistry1,168139,7802,067,9930.5%
Private life21,5684,569,05450,039,13812.7%
Electronics74845,599670,1430.2%
Energy industry17718,067277,4050.1%
этнография78,098164,1640.0%
Total221,68129,475,602393,744,258100%

Dates

Texts within the main corpus by dates created

DateNumber of textsNumber of sentencesNumber of tokensPercentage of tokens
1651 - 1700516,633332,6480.1%
1701 - 175038264,5911,255,8710.3%
1751 - 18001,931338,3026,631,3051.7%
1801 - 18503,3231,195,34918,670,1434.7%
1851 - 19004,9764,587,24264,977,23916.2%
1901 - 195057,9158,883,268103,840,73926.0%
1951 - 200021,9989,984,063111,278,50627.8%
2001 - 205043,4857,395,05093,108,35923.3%
Total134,01532,464,498400,094,810100%

Parts of speech

Tokens by part of speech (Disambiguated corpus only)

Part of speechNumber of tokensPercentage of tokens
noun1,718,41028.7%
Adjective510,9578.5%
Numeral96,8511.6%
of these, recorded in numbers53,8170.9%
of these, recorded in writing43,0340.7%
numeral adjective24,5890.4%
Verb1,013,24816.9%
Adverb253,5734.2%
Predicative42,7620.7%
Parenthesis26,7210.4%
Pronoun471,7007.9%
Adjectival pronoun280,7164.7%
Adverbial pronoun130,4342.2%
Predicative pronoun (некого, нечего)6780.0%
Preposition626,90610.5%
Conjunction475,7697.9%
Particle266,6754.5%
Interjection8,6280.1%
Initital10,0020.2%
Other (foreign words, onomatopoeia)29,5360.5%
Total5,988,155100%

Updated on