Statistics in subcorpus portrait

The Subcorpus statistics section provides tables, graphs, and charts that show the actual and diachronic statistics of the custom user's subcorpus in comparison with the corpus:

  1. A table and a graph with corpus and subcorpus size (number of texts and words)
  2. Geographical maps and diagrams showing the size distribution of the subcorpus and the corpus by country and region (only for corpora with regional mark-up).
  3. Diagrams showing the distribution of meta-attribute values in the corpus and subcorpus

You can access the corpus and subcorpus comparison statistics through the corpus portrait by clicking the (i) button in the corpus header.

At the moment, actual statistics for the subcorpus are available for the Main, Educational, Media corpora, some historical corpora as well as "Russian classics" and "From 2 to 15". Later, it will be added to more corpora.

Diachronic statistics for the subcorpus are available in the Main and Regional and International media only.

All charts and graphs are provided with a standard tooltip (?), which explains how to interpret the visualization, and can be downloaded as an Excel/CSV file or a screenshot.

Actual statistics

Corpus and subcorpus size

Corpus and subcorpus sizes are shown in texts and in words:

 

Maps

For comparison, two geographical maps are shown, on which you can see the regional distribution of the volumes of the subcorpus and corpus in the selected unit of measurement (texts or words). When you switch the unit of measurement, the maps are redrawn.

The size of the corpus in a particular region is shown by the color scale. When you hover the mouse over the shaded area, you can see the name of the region and the corresponding number of texts or words in the corpus.

You can download an Excel/CSV file ith the original data used to build the map.

 

Distribution of texts

To compare the texts of the subcorpus and the corpus, two diagrams are shown. You can select the meta-attribute for which the diagrams are to be plotted from the list of the most representative attributes of the corpus, as well as the unit of size measurement: texts or words. When switching meta-attribute and/or unit of measurement, the chart is redrawn.

For ease of comparison, the distribution of the top ten values of the selected meta-attribute in the subcorpus and the corresponding values in the corpus have been calculated. The remaining values are grouped under the Other category.

The duo-directional bar diagram shows the differences between the subcorpus and the corpus. On the right side, highlighted in green, you can see how much more is the share of the meta-attribute value in the subcorpus. On the left side, highlighted in red, you can see how much more is the share of the meta-attribute value in the corpus.

On the bar chart, the shares of each meta-attribute value in the size of the subcorpus and corpus are shown side by side. When you hover the mouse over a column of the chart, you can see the name of the value and the proportion and number of texts or words corresponding to it in the subcorpus and corpus.

You can download an Excel/CSV file with the source data used to build the charts, as well as the charts as image files.

Diachronic statistics

In Diachronic statistics section you can set the distribution, dates, and smoothing of frequencies. The specified parameters are applied to all graphs on the page.

Diachronic distribution of the subcorpus size

Using the graph you can compare the distribution of subcorpus texts in comparison with the corpus over time. When you hover the mouse over the graph, you can see the number of texts or words in the corpus and the subcorpus with and without smoothing.

Beneath the graph you will find warming stripes, illustrating the number of texts in the corpus and the given subcorpus.

With the help of the windows for displaying dates and frequencies on graphs, you can zoom in or out certain sections of the graph, as well as navigate through the values on the axes. 

 

Distribution of texts

To compare the texts in the subcorpus and the corpus, two diagrams are shown. The user can select which meta attribute to build a diagram for from the list of the most significant attributes of the corpus, as well as the preferred unit of measurement (texts or words). When you switch the meta attribute and/or unit of measurement, the diagram gets redrawn.

To make comparison easier, the distribution of the top 10 values of the selected meta attribute in the subcorpus and their corresponding values in the corpus are calculated. The remaining values are merged into the Other category.

When you hover the mouse over the shaded area of the diagram, you can see the name of the value and the corresponding proportion and number of texts or words in the subcorpus and corpus. Using the "window" to display dates, which is common to both charts, you can also adjust the time period for a more in-depth analysis.

 

Regions

The graph shows the distribution of the texts in the corpus and the given subcorpus in the selected unit of measurement (texts or words) by regions. When you switch the meta attribute and/or units of measurement, the graph gets redrawn.

To make comparison easier, the distribution of the top 10 values of the selected meta attribute in the subcorpus and their corresponding values in the corpus are calculated. The remaining values are merged into the Other category.

When you hover the mouse over the shaded area, you can see the name of the region and the corresponding number and share of texts or words in the corpus and the subcorpus. Using the "window" to display dates, which is common to both charts, you can also adjust the time period for a more in-depth analysis.

Updated on 01.07.2024