Портрет корпуса

Corpora

31 texts
252,871 words

parallel, multimedia, spoken

Corpora: MultiPARC | English-Russian

MultiPARC

The Multimedia Parallel Corpus combines the features of the multimedia corpus and a parallel corpus, and is designed for comparative studies. The corpus consists of two independent zones, differing both in the nature of the material and in the way it is organized.

The Russian MultiPARC provides the opportunity to compare various film, television, radio and theater productions of the same play . Currently, the Russian MultiPARC includes Revizor (The Government Inspector) by Nikolai Gogol, presented in nine different productions, Vishnevyj sad (The Cherry Orchard) by Anton Chekhov in four productions, Djadja Vanja (Uncle Vanya) by Antov Chekhov in five productions, Tri sestry (Three Sisters) by Anton Chekhov in four productions. The Russian MultiPARC provides the opportunity for a comparative study of the same remark uttered by different speakers in the same circumstances. Such studies enable the establishment of the limits of variation in various aspects of spoken speech and its gestural accompaniment, depending on factors related to the performer's personality, the time and style of the production, the director’s intentions, and more.

The technology used to prepare the corpus is quite complex and resembles the preparation of a multilingual parallel corpus of written translations of the same text. The published text of the play serves as the “anchor” text, against which all versions of its performance are compared. The text of the play is divided into fragments, according to which the audio or video recording of the production is fragmented. Each audio or video fragment is then aligned with its written transcript. Search results are presented in the form of clusters, with each cluster containing context from the printed text of the play that includes the requested element, along with corresponding fragments from all productions, accompanied by the respective videos.

The English-Russian MultiPARK includes fragments of TV series and films in English with Russian voice-over translation or dubbing, as well as various productions of plays in both Russian and in English. This allows for the comparison and study of the speech behavior of people from different cultures, speaking different languages, but finding themselves in similar situations.

Each film, both the original and the translation, is cut into small fragments (clips). The English and Russian transcripts of these fragments are also divided into corresponding fragments. Subsequently, two clips (English and Russian) and two transcripts (English and Russian) are aligned with each other. The numbering of clips and text fragments is consistent between the English and Russian versions.

Each text fragment is annotated in accordance with the standards of MURCO and the parallel corpus of the RNC and contains various levels of annotation, including metatextual, morphological (annotated in both the original and the translation), semantic (in the Russian translation), accentological (in the Russian translation), and sociological annotation (information about the original performer and the dubbing performer). Upon user request, two pairs of clixts are given (in English and Russian), in which the video and text series are aligned with each other. This presentation of the material enables comparative studies in the areas of intonation and phonetics, vocabulary and semantics, phraseology, syntax, gesticulation analysis in English-language discourse and comparative gestural studies by comparing the obtained data with the MURCO data. Additionally, this corpus provides examples of a special type of speech activity in Russian: the translation of audiovisual texts, which is considered as an independent type of translation activity.

Publications

Check out the list of scientific publications on the Multimedia Parallel Corpus via the link: https://ruscorpora.ru/s/dPl92. To find other types of publications related to the corpus, use the filters in the "Publications" section.

Corpus creation

The Multimedia Parallel Corpus was created under the leadership of Elena Grishina, with the support from the Russian Foundation for Basic Research (Grant No. 14-06-00245) as well as the Corpus Linguistics and Language and Information Technologies programs of the Russian Academy of Sciences. Subsequent developmentof the corpus was made possible through funding from the program of fundamental scientific research of the Presidium of the Russian Academy of Sciences titled “Monuments of Material and Spiritual Culture in the Modern Information Environment” (2018-2020).

Elena Grishina played a pivotal role in conceiving the corpus's concept, determining principles for data selection, designing the annotation system, and developing the technology for database preparation. Coordination of work and editing of the annotation was undertaken by Elena Grishina and Svetlana Savchuk (since 2016). Elena Grishina, Anna Kursakova, Alexandra Makhova, Svetlana Savchuk, Anna Sosedova participated in the preparation of text and multimedia materials. Lev Alekseevsky, Dmitry Vylegzhanin, Alexey Zobnin, Viktor Sizov, Igor Shalyminov made contributions to the creation and enhancement of the corpus's software, which encompassed the search system and various types of annotation, at different stages of the project.

Updated on 02.05.2024