five

TalbankenSBX

收藏
DataCite Commons2025-12-12 更新2025-04-16 收录
下载链接:
https://spraakbanken.gu.se/resurser/talbanken
下载链接
链接失效反馈
官方服务:
资源简介:
Talbanken is a widely used Swedish treebank, read more about its history and different versions here. This version originated as a copy of TalbankenSTB, but unlike the STB version, this one is open to changes and corrections. This is also the version indexed by our search engine Korp. The changes made by us can be found in changelog.txt. Annotation The following layers of annotation were added (or corrected) manually and can be considered gold data: tokenization, sentence segmentation, POS, MSD, dependency syntax (deprel and dephead). Tokenization, sentence segmentation, POS and MSD follow the SUC format, syntactic annotation follows the Mamba-Dep format, a conversion of the MAMBA format used in the original Talbanken76 to dependency grammar. Read more about these annotation layers in the documentation for TalbankenSTB or at Joakim Nivre's page: tokenization and sentence segmentation, POS and MSD, dependency syntax. Formats and splits TalbankenSBX is provided in our standard XML format and in a (pseudo-)CONLLU format, where UPOS is POS in the SUC format, XPOS is POS+MSD, Feats are MSD converted to the UD/CONLLU standard, and Deprel is a Mamba-Dep relation. There are currently no text and SpaceAfter attributes. You may convert our XML to this format Talbanken yourself using the script in this repository. We provide two splits of TalbankenSBX. MorphSplit is used for POS-tagging purposes: the treebank is divided into two parts with the same number of sentences (the split is completely random, no blocks are used). One part is used as the development set, the other is the test set (SUC3 is the training set). You may resplit the Talbanken yourself using the script in this repository. SyntSplit used is for dependency parsing: the treebank is divided into the training, development and test sets. The training set is the same as the one in TalbankenSTB, whereas dev and test approximate dev and test in the UD version as much as possible. The SyntSplit is provided only in the CONLLU format.
提供机构:
Språkbanken Text
创建时间:
2024-06-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作