Training corpus SETimes.SR 1.0

Name: Training corpus SETimes.SR 1.0
Creator: hdl.handle.net
License: 暂无描述

hdl.handle.net2025-01-21 收录

下载链接：

http://hdl.handle.net/11356/1200

下载链接

链接失效反馈

官方服务：

资源简介：

The SETimes.SR training corpus contains 86 726 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities. The annotations (and other aspects) of the corpus are documented in the teiHeader and back element of the TEI encoded corpus. In short, they follow (1) the MULTEXT-East V5 morphosyntactic specifications, http://nl.ijs.si/ME/V5/msd/, (2) the UDv2 Guidelines, http://universaldependencies.org/guidelines.html, and (3) the Janes annotation guidelines for named entities, http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf.

SETimes.SR 训练语料库包含 86,726 个经过人工标注的标记，标注级别涵盖分词、句子切分、形态句法标注、词元化、句法依存关系和命名实体。语料库的标注（及其他方面）在 TEI 编码语料库的 teiHeader 和 back 元素中进行了详细记录。简而言之，它们遵循以下规范：(1) MULTEXT-East V5 形态句法规范，[http://nl.ijs.si/ME/V5/msd/](http://nl.ijs.si/ME/V5/msd/)；(2) UDv2 指南，[http://universaldependencies.org/guidelines.html](http://universaldependencies.org/guidelines.html)；以及(3) Janes 命名实体标注指南，[http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf](http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf)。

提供机构：

hdl.handle.net

5,000+

优质数据集

54 个

任务类型

进入经典数据集