Serbian linguistic training corpus SETimes.SR 2.0

SSH Open MarketPlace2023-10-17 更新2024-08-03 收录

下载链接：

https://marketplace.sshopencloud.eu/dataset/mnjY2S

下载链接

链接失效反馈

官方服务：

资源简介：

This training corpus contains around 100,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities. The annotation formalisms followed in the SETimes.SR corpus are (1) [MULTEXT-East V6 morphosyntactic specifications](http://nl.ijs.si/ME/V6/msd/), (2) the [UDv2 Guidelines](http://universaldependencies.org/guidelines.html), and (3) [Janes annotation guidelines for named entities](http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). The difference to the previous version of the corpus are (1) the extension of the corpus with 502 sentences from various news sources and (2) improvements in the annotations of the corpus. The corpus is available for download from the CLARIN.SI repository.

这个训练语料库包含约10万个Token，这些Token在分词（tokenisation）、句子分割（sentence segmentation）、形态句法标注（morphosyntactic tagging）、词形还原（lemmatisation）、句法依存（syntactic dependencies）及命名实体（named entities）层面均经过人工标注。SETimes.SR语料库遵循的标注形式包括：(1) [MULTEXT-East V6形态句法规范](http://nl.ijs.si/ME/V6/msd/)；(2) [UDv2指南](http://universaldependencies.org/guidelines.html)；(3) [Janes命名实体标注指南](http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf)。该语料库与之前版本的差异在于：(1) 新增了来自各类新闻源的502个句子；(2) 优化了语料库的标注质量。该语料库可从CLARIN.SI知识库下载。

创建时间：

2023-10-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集