five

Corpus of 1968 Slovenian literature Maj68 3.0

收藏
hdl.handle.net2025-01-09 收录
下载链接:
http://hdl.handle.net/11356/1970
下载链接
链接失效反馈
官方服务:
资源简介:
Maj68 corpus contains 1,521 texts (about a million words) by 198 known authors published between 1964 and 1972 in the periodicals "Tribuna", "Problemi" and "Problemi. Literatura." The texts contain complete bibliographical data, are classified according to text and language type, degree of presence of non-standard Slovenian, foreign languages, modernism, and visual elements. The data about the authors of the texts are provided with their gender and year of birth. The presence of visual elements is marked in the corpus; note that 48 texts have only visual elements, i.e. do not contain any text. The corpus is available as facsimiles (PDFs), in TEI encoding, as plain text files accompanied by metadata files, as a linguistically annotated TEI corpus, and the derived vertical files and registry file, for mounting on CWB-type concordancers. The TEI encoding follows the CLARIN.SI TEI customisation (https://github.com/clarinsi/TEI-schema). The automatic linguistic annotation includes lemmas, MULTEXT-East morphosyntactic descriptions and Universal Dependencies morphological features and syntactic annotation. and was performed by the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), using the models for standard Slovenian. As opposed to to the previous version, this corpus also includes manually assigned linguistic categories, in particular for names and for spans of text not written in standard Slovenian. Both of these categories are further subdivided, and the typology is given in the accompanying PDF file of this entry.

Maj68语料库包含由198位知名作者在1964年至1972年间于《Tribuna》、《Problemi》以及《Problemi. Literatura.》等期刊上发表的1,521篇文本(约一百万字)。这些文本包含完整的书目信息,并根据文本类型、语言种类、非标准斯洛文尼亚语的使用程度、外语、现代主义和视觉元素进行分类。文本作者的资料包括性别和出生年份。视觉元素的存在在语料库中均有标注;请注意,其中有48篇文本仅包含视觉元素,即不包含任何文本内容。 该语料库以副本(PDF文件)、TEI编码、纯文本文件及其元数据文件、语言注释的TEI语料库、派生垂直文件和登记文件的形式提供,可用于安装在CWB类型的一致性分析器上。TEI编码遵循CLARIN.SI TEI定制规范(https://github.com/clarinsi/TEI-schema)。 自动语言标注包括词元、MULTEXT-East形态句法描述和通用依赖关系的形态学特征及句法标注,由CLASSLA-Stanza管道(https://github.com/clarinsi/classla)执行,并使用标准斯洛文尼亚语的模型。 与先前版本不同,该语料库还包括手动分配的语言学类别,特别是对于人名以及非标准斯洛文尼亚语撰写的文本段。这两个类别均进一步细分,其类型学在随附的PDF文件中有详细说明。
提供机构:
hdl.handle.net
二维码
社区交流群
二维码
科研交流群
商业服务