five

Corpus of longer narrative Slovenian prose KDSP 1.0

收藏
SSH Open MarketPlace2023-10-17 更新2024-08-03 收录
下载链接:
https://marketplace.sshopencloud.eu/dataset/wjoSAV
下载链接
链接失效反馈
官方服务:
资源简介:
This corpus contains 262 texts of longer older Slovenian narrative prose. The texts were published between 1836 and 1918 and are at least 20,000 words long. The texts have bibliographical metadata (author name, title, year of publication, length) and are classified according to the decade of publication, length, text type, text subtype, theme, and level of canonicity (texts by those authors included in school textbooks after 1980 and/or included in the Collected writings of Slovenian poets and writers, are marked with a high degree of canonicity). The metadata about the authors of the texts are provided with their gender, occupation, and years of birth and death. The corpus texts come from three digital sources, and each text is marked for its source. They are [Wikisource](https://sl.wikisource.org/wiki/) (145 texts), the [ELTeC corpus](https://github.com/COST-ELTeC/ELTeC-slv) (96 texts), and the [dLib digital library](https://www.dlib.si/) (21 texts). The corpus is provided in two variants, one containing running text and the other with added linguistic analyses. These comprise tokens, sentences, lemmas, MULTEXT-East morphosytactic descriptions and Universal Dependencies morphological features. The linguistic annotation was performed with the [CLASSLA program](https://github.com/clarinsi/classla). The source format of the corpus in TEI/XML, with two derived formats also available: one is plain text, and the other vertical files, as used by concordances, like the CWB. The corpus is available for download from CLARIN.SI as well as through the noSketchEngine and KonText concordancers.
创建时间:
2023-10-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作