Croatian linguistic training corpus hr500k 2.0

SSH Open MarketPlace2025-07-04 更新2025-07-05 收录

下载链接：

https://marketplace.sshopencloud.eu/dataset/mBDaux

下载链接

链接失效反馈

官方服务：

资源简介：

This training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels. The annotation formalisms followed in the hr500k corpus are (1) the [MULTEXT-East V6](https://nl.ijs.si/ME/V6/msd/) morphosyntactic specifications for the Serbo-Croatian macro-language, (2) the [UDv2 Guidelines](http://universaldependencies.org/guidelines.html), (3) the [Janes annotation guidelines for named entities](https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf), (4) the [PARSEME guidelines for annotating multi-word expressions](https://parsemefr.lis-lab.fr/parseme-st-guidelines/1.3/) and (4) the [semantic role labelling annotation protocol for Slovenian and Croatian](https://www.sdjt.si/wp/wp-content/uploads/2018/09/JTDH-2018_Gantar-et-al_Towards-Semantic-Role-Labeling-in-Slovene-and-Croatian.pdf). The corpus is available for download from the CLARIN.SI repository.

创建时间：

2025-07-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集