Портрет корпуса

Corpora

1,392 texts
111,213 sentences
1,595,659 words

syntactically tagged, disambiguated

Corpora: SynTagRus

Syntactic Corpus

This fragment of the Russian National Corpus, also known as SynTagRus (Syntactically Tagged Russian Text corpus), has been elaborated by the Laboratory of Computational Linguistics, A.A.Kharkevich Insitute of Information Transmission Problems, Russian Academy of Sciences

The Syntactic Corpus is composed of texts of two major types:

Popular science, sociopolitical and information papers published in magazines and on the Internet (from 1980 to the present);
Russian fiction of the second half of the XX^th century and early XXI^st century.

The corpus is annotated semi-automatically. First, each text is processed by the morphological analyzer and the syntactic parser of the multipurpose linguistic processor, ETAP, created by the Laboratory of Computational Linguistics. As a result, every sentence of the text obtains a morphological and a syntactic structure. Then the structures are checked by linguist experts and, if necessary, corrected by them.

The Syntactic corpus contains text provided with full morphosyntactic annotation. Specifically, apart from the morphological information assigned to every word of the text, every sentence is supplied with a syntactic structure presented as a dependency tree. The nodes in such a tree are words of the sentence, while its edges are labeled with names of syntactic relations.

This way of representing the syntactic structure originates from “Meaning ⇔ Text” linguistic model by Igor A. Mel’čuk and Alexander K. Zholkovsky. However, the complete repertory of syntactic relations , as well as other specific linguistic solutions on how to represent the syntax of Russian sentences, has been developed by the Laboratory for Computational Linguistics which created SynTagRus.

The list of morphological features used in the Syntactic corpus differs somewhat from the main morphological standard of the RNC and is explained in the final part of the section on the morphological marking of the RNC (in Russian).

The repertory of syntactic relations of SynTagRus, supplied with short comments, is given in the section called "Syntactic Annotation" (in Russian).

Unlike the majority of RNC subcorpora, the SynTagRus only contains fully disambiguiated annotations. This means that every word of a corpus sentence is supplied with a unique morphological structure, and every sentence is matched by a unique syntactic structure.

In addition to morphosyntactic annotation, the Syntactic corpus offers several additional types of annotation.

Updated on 26.11.2024