Corpora
  • 3,067 texts
  • 790,310 words
multimedia, spoken, disambiguated
Dialect corpus

The Dialect Corpus has been in development since 2005. It includes recordings of dialect speech from various regions of Russia and neighbouring countries such as Belarus and Azerbaijan, representing both early-settlement dialects (North, Center, West, South) and later-settlement dialects (Volga region, Caucasus, Urals, Siberia, Far East). The corpus features both spontaneous speech and personal narratives, as well as folklore texts, both prose and verse. About one third of the texts is accompanied by audio and video recordings that correspond to the entire text, not just the excerpt shown in the search results.

The Dialect Corpus is expanded mainly through already published regional dialect readers, which were typically issued in small print runs as dialectology teaching aids for students at local universities, as well as through fieldwork materials from dialectological expeditions submitted to the Corpus. Transcriptions of field materials are accepted in phonetic transcription, in a phonologized (orthography-based) format, and even in near-standard orthography, preferably with stress marks and dialect-specific grammatical features preserved. Submissions accompanied by audio recordings are encouraged.

Annotation and tools

The morphological, syntactic, and lexical features of the texts are fully preserved. A portion of texts are transcribed in phonologized form with stress marks; another part is presented in near-standard orthography. In any case each word form is also annotated with its normalized version. The corpus includes special annotations for features of dialectal morphology, including phenomena absent in the standard language — for example, non-standard gender usage.

Dialect-specific vocabulary is accompanied by definitions. For many lexemes, related words are indicated, either as part of inflectional relationships or synonymy.

The corpus includes extensive metadata, covering:

  • Phonetic features found in each text (vowel and consonant systems)
  • Date and genre
  • Time and place of the events described
  • Sociological data about the informant
  • Administrative location of the place where the recording was made
  • Dialectologist who provided the text
  • Previous publications of the material

Users can filter subcorpora by many of these parameters, including recording availability and orthographic type.

For some texts prepared before 2008, the metatextual annotation is less detailed at all levels and, in particular, does not include phonetic data.

The corpus provides a full set of search and visualization tools, including:

  • Regular expression search (for both lemmas and word forms)
  • Charts
  • Frequency data
  • Statistics based on key metadata (including division into okanye and akanye dialects)
  • N-grams
  • Frequency dictionaries
  • Paradigms for nouns (available in the Word at a glance tool)

Work is underway to integrate the corpus’s geographic database of recorded locations with the Digital Dialectological Atlas of the Russian Language.

Publications

Check out the list of academic publications on the Dialect Corpus via the link: https://ruscorpora.ru/s/e0pB3. In the Publications section, use filters to find other types of publications about the corpus.

Updated on 23.02.2026