Syntactic Corpus

This fragment of the Russian National Corpus, also known as SynTagRus (Syntactically Tagged Russian Text corpus), has been elaborated by the Laboratory of Computational Linguistics, A.A.Kharkevich Insitute of Information Transmission Problems, Russian Academy of Sciences

The Syntactic Corpus is composed of texts of two major types:

  • Popular science, sociopolitical and information papers published in magazines and on the Internet (from 1980 to the present);
  • Russian fiction of the second half of the XXth century and early XXIst century.

The corpus is annotated semi-automatically. First, each text is processed by the morphological analyzer and the syntactic parser of the multipurpose linguistic processor, ETAP, created by the Laboratory of Computational Linguistics. As a result, every sentence of the text obtains a morphological and a syntactic structure. Then the structures are checked by linguist experts and, if necessary, corrected by them.

The Syntactic corpus contains text provided with full morphosyntactic annotation. Specifically, apart from the morphological information  assigned to every word of the text, every sentence is supplied with a syntactic structure presented as a dependency tree. The nodes in such a tree are words of the sentence, while its edges are labeled with names of syntactic relations.

This way of representing the syntactic structure originates from “Meaning ⇔ Text” linguistic model by Igor A. Mel’čuk and Alexander K. Zholkovsky. However, the complete repertory of syntactic relations , as well as other specific linguistic solutions on how to represent the syntax of Russian sentences, has been developed by the Laboratory for Computational Linguistics which created SynTagRus.

The list of morphological features used in the Syntactic corpus differs somewhat from the main morphological standard of the RNC and is explained in the final part of the section on the morphological marking of the RNC (in Russian).

The repertory of syntactic relations of SynTagRus, supplied with short comments, is given in the section called "Syntactic Annotation" (in Russian).

Unlike the majority of RNC subcorpora, the SynTagRus only contains fully disambiguiated annotations. This means that every word of a corpus sentence is supplied with a unique morphological structure, and every sentence is matched by a unique syntactic structure.

In addition to morphosyntactic annotation, the Syntactic corpus offers several additional types of annotation.

Lexical Functional Annotation

The corpus provides the data on lexical functions appearing in the text of the corpus. The apparatus of lexical functions was also proposed by the authors of the “Meaning ⇔ Text”. The Syntactic corpus exhibits the collocate variety of lexical functions, which represent idiomatic and semi-idiomatic expressions whose parts convey multiple types of semantic links. The Syntactic corpus cites over 100 lexical functions, which are used in more than 20,000 phrases. The outline of the lexical functions reflected in the Corpus and a short description of each function can be found under the "Lexical Functional Annotation" section (in Russian).

Lexical Semantic Annotation

Lexical Semantic Annotation points out, for every polysemantic word contained in the Corpus, the individual lexical meaning of this word as presented in the dictionary of the multipurpose linguistic processor ETAP). When viewing the search results the interpretation of a word's lexical meaning is given in the word's record card.

Elliptical Annotation

Elliptical Annotation restores, for simple types of linguistic ellipsis, the omitted words of a sentence and incorporates them into the syntactic structure of this sentence. If omitted words are queried during the search, they will be displayed in their dictionary form. For instance, the sentence «Яду мне, яду» from the novel "The Master and Margarita" written by Mikhail Bulgakov, will be displayed in the following way:

«Яду [давать] мне, яду!»

The outline of the elliptical annotation reflected in the Corpus can be found under the "Syntactic Ellipsis Representation" subsection of the syntactic annotation description (in Russian).

Microsyntactic Annotation

Microsyntactic Annotation identifies multiword idiomatic and semi-idiomatic expressions that behave as semantic and/or syntactic unities of versatile nature. These unities can be exemplified with expressions like всё равно ‘all the same’, потому что ‘because, the reason being’, в соответствии с ‘in accordance with’, как раз ‘just now’, что толку ‘what’s the point’ etc. SynTagRus identifies over 3,200 microsyntactic units occurring in the corpus more than 47,000 times. The outline of the microsyntactic annotation reflected in the Corpus can be found under the "Microsyntactic Annotation" section (in Russian).

Coreference Annotation

Coreference Annotation identifies the words of the text that are anaphorically and/or coreferentially linked. The outline of that type of annotation reflected in the Corpus can be found under the "Coreference Annotation" section (in Russian).

At present, coreference annotation is available for search only within a single sentence.

Temporal Annotation

Temporal Annotation identifies words and phrases with temporal semantics such as одновременно 'simultaneously', вечером 'in the evening', 23 мая 'on May 23', в полночь 'at midnight', с детства 'since childhood', and reflects their contribution to the formation of the sentence's meaning. The outline of the temporal annotation reflected in the Corpus can be found under the "Temporal Annotation" section (in Russian).

At present, temporal annotation is available for search only within a single sentence.

Publications

Check out the list of scientific publications on the Syntactic corpus via the link: https://ruscorpora.ru/s/e5xrA. To find other types of publications related to the corpus, use the filters in the "Publications" section.

Publications describing individual annotation types are mentiond in the corresponding secions.

Updated on 15.08.2024