Corpus of Slovene linguistic scientific writing JezKor
收藏SSH Open MarketPlace2023-10-17 更新2024-08-03 收录
下载链接:
https://marketplace.sshopencloud.eu/dataset/qwtPOR
下载链接
链接失效反馈官方服务:
资源简介:
This corpus contains a collection of linguistic scientific writing in the Slovenian language. It consists of 43 monographs published between 2009 and 2022 by Fran Ramovš institute of Slovenian language and Založba ZRC, 267 papers published in the journal "Jezikoslovni zapiski" and 28 papers published in the journal "Slovenski jezik". Note that the texts were obtained directly from PDFs, so they contain various types of noise.
The corpus is linguistically annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla) on the levels lemmatisation, MULTEXT-East Version 6 morphosyntactic descriptions, Universal Dependencies part-of-spech and morphological features, and named entities. It is distributed in CoNLL-U and vertical file format, one file for each text. Text metadata consists of the author(s), title and year of publication.
The corpus is available for download from the CLARIN.SI repository as well as for online browsing through the noSketch Engine and KonText concordancers.
创建时间:
2023-10-17



