five

Croatian linguistic training corpus hr500k 2.0

收藏
SSH Open MarketPlace2025-07-04 更新2025-07-05 收录
下载链接:
https://marketplace.sshopencloud.eu/dataset/mBDaux
下载链接
链接失效反馈
官方服务:
资源简介:
This training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. A subset of the syntactically annotated corpus is also annotated for multi-word expressions. Furthermore, about a fifth of the corpus is annotated with semantic role labels. The annotation formalisms followed in the hr500k corpus are (1) the [MULTEXT-East V6](https://nl.ijs.si/ME/V6/msd/) morphosyntactic specifications for the Serbo-Croatian macro-language, (2) the [UDv2 Guidelines](http://universaldependencies.org/guidelines.html), (3) the [Janes annotation guidelines for named entities](https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf), (4) the [PARSEME guidelines for annotating multi-word expressions](https://parsemefr.lis-lab.fr/parseme-st-guidelines/1.3/) and (4) the [semantic role labelling annotation protocol for Slovenian and Croatian](https://www.sdjt.si/wp/wp-content/uploads/2018/09/JTDH-2018_Gantar-et-al_Towards-Semantic-Role-Labeling-in-Slovene-and-Croatian.pdf). The corpus is available for download from the CLARIN.SI repository.
创建时间:
2025-07-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作