Training corpus SETimes.SR 1.0
收藏hdl.handle.net2025-01-21 收录
下载链接:
http://hdl.handle.net/11356/1200
下载链接
链接失效反馈官方服务:
资源简介:
The SETimes.SR training corpus contains 86 726 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities.
The annotations (and other aspects) of the corpus are documented in the teiHeader and back element of the TEI encoded corpus. In short, they follow (1) the MULTEXT-East V5 morphosyntactic specifications, http://nl.ijs.si/ME/V5/msd/, (2) the UDv2 Guidelines, http://universaldependencies.org/guidelines.html, and (3) the Janes annotation guidelines for named entities, http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf.
SETimes.SR 训练语料库包含 86,726 个经过人工标注的标记,标注级别涵盖分词、句子切分、形态句法标注、词元化、句法依存关系和命名实体。语料库的标注(及其他方面)在 TEI 编码语料库的 teiHeader 和 back 元素中进行了详细记录。简而言之,它们遵循以下规范:(1) MULTEXT-East V5 形态句法规范,[http://nl.ijs.si/ME/V5/msd/](http://nl.ijs.si/ME/V5/msd/);(2) UDv2 指南,[http://universaldependencies.org/guidelines.html](http://universaldependencies.org/guidelines.html);以及(3) Janes 命名实体标注指南,[http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf](http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf)。
提供机构:
hdl.handle.net



