Arboretum treebank

Name: Arboretum treebank
Creator: ELRA Catalogue of Language Resources
Published: 2015-11-30 00:00:00
License: 暂无描述

catalogue.elra.info2015-11-30 更新2025-03-22 收录

下载链接：

https://catalogue.elra.info/en-us/repository/browse/ELRA-W0084/

下载链接

链接失效反馈

官方服务：

资源简介：

The Arboretum treebank is a morphologically and syntactically annotated repository of Danish sentences, taken from Korpus 90 and Korpus 2000, both compiled by the Society for Danish Language and Literature (http://ordnet.dk/korpusdk/fakta), and containing samples of written Danish from the 90'ies and from around the year 2000, respectively. The treebank consists of about 425,000 tokens. There are ca. 22,260 sentences/utterances containing 3 or more tokens.In a first pass, all material was tokenized and tagged with the DanGram parser, using hand-written Constraint Grammar rules. In a next stage, the parser's dependency grammar and constituent conversion was applied to produce full syntactic tree structures. The automatic annotation was then revised both at the morphosyntactic and the structural levels, with iterative improvements made to the parser at the same time.Arboretum provides named entity categories for all proper nouns. It also contains subclass categorisation for the pronoun and adverb word classes, facilitating conversion to different descriptive traditions. In addition, the dependency version contains structural markers concerning coordination and clause boundaries, as well as some morphological information concerning compounding.The final version of the treebank consists of two independent versions, constituent trees and dependency trees, and is distributed in the following versions:1. Native dependency format (Constraint Grammar format)2. Dependency annotation converted to MALT xml format3. Native constituent tree format (Cross-language VISL standard)4. Constituent format converted to TIGER xml

阿博罗伊森语料库系由丹麦语言与文学学会（http://ordnet.dk/korpusdk/fakta）编纂的Korpus 90和Korpus 2000中的句子构成的形态学和句法标注语料库。该语料库涵盖了20世纪90年代及2000年左右的书面丹麦语样本。语料库包含约425,000个标记。其中约22,260个句子或话语包含三个或更多标记。在初步处理阶段，所有材料均经Token化并使用手写约束语法定义进行标注。随后，对解析器的依存语法和成分转换进行了应用，以生成完整的句法树结构。随后，对形态句法和结构层面的自动标注进行了修订，同时对解析器进行了迭代优化。阿博罗伊森语料库为所有专有名词提供了命名实体类别。此外，它还包含了代词和副词词类的子类别分类，便于转换为不同的描述传统。此外，依存关系版本包含有关协调和子句边界的结构标记，以及一些有关复合词的形态学信息。最终版本的语料库包含两个独立版本，即成分树和依存树，并以以下版本分发：1. 原生依存关系格式（约束语法格式）2. 转换为MALT xml格式的依存关系标注3. 原生成分树格式（跨语言VISL标准）4. 转换为TIGER xml格式的成分格式。

提供机构：

ELRA Catalogue of Language Resources

5,000+

优质数据集

54 个

任务类型

进入经典数据集