Corpora: Multimedia
Multimedia corpus (MURCO)

The Multimedia Russian Corpus (MURCO) is designed for the study of oral speech across various genres. Originally conceived as a corpus of movie dialogues, a pllot version of MURCO was launched in 2009-2010. Subsequently, the corpus has expanded to include samples of oral speech from diverse genres. Currently, the size of the corpus is nearing 5.5 million words. MURCO includes the following sections (subcorpora):

1. Movie dialogues: Soviet and Russian films spanning the years 1930 to the 2000s. 

2. Public oral speech: academic discourse (talks and discussions at conferences, educational and popular lectures, TV and radio broadcasts), political speech (interviews, press conferences, speeches at rallies, meetings and congresses, TV and radio talk shows), journalism (interviews and conversations on various topics, documentaries, etc.), advertising (commercial videos).

3. Oral non-public speech: everyday communication, dialogues and micro-dialogues, conversations with friends and family, telephone conversations, and more.

4. Theatrical speech: audio and video recordings of theatrical performances on stage and on the radio.

5. Writer's reading and artistic reading: written-to-be-spoken speech, recordings of prose performed by the author or by an elocutionist, which is interesting in terms of phonetic features, accents, and text interpretation.

The Multimedia corpus texts are presented as audio and video files segmented into small clips lasting between 10 to 30 seconds, each correspoding to a fragment of the text transcript. Known as "clixt" (a term coined by Elena Grishina), each clip-text pair typically represents a relatively complete communicative fragment.

Every text fragment is annotated in accordance with MURCO standards. These annotations cover metatextual, morphological, semantic, accentological and sociological aspects, all of which are searchable online. In addition, it is possible to search for words with specific phonetic or syllable structures.

MURCO also includes a deeply annotated section, where various types of speech actions and gestures are marked. The annotation was made by Elena Grishina. Currently this part includes 6 films. The speech actions annotation allows selecting utterances with specific semantics (questions, imperatives, modal statements, etiquette statements), types of speech underlining (parceling, chanting), interjections, vocal gestures, and repetitions. Gesture marking enables the selection of gestures based in their subjective characteristics (type and meaning) and objective attributes (active or passive organ, orientation in space, direction of movement, etc.). By specifying the desired characteristics, you can see the clips where the corresponding speech actions and gestures occur.

Publications

Check out the list of scientific publications on the Multimedia corpus via the link: https://ruscorpora.ru/s/dBp2W. To find other types of publications related to the corpus, use the filters in the "Publications" section.

Updated on 22.07.2024