Akindelevictoria/yoruba-constituency-treebank

Name: Akindelevictoria/yoruba-constituency-treebank
Creator: Akindelevictoria
Published: 2025-12-18 05:02:41
License: 暂无描述

Hugging Face2025-12-18 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/Akindelevictoria/yoruba-constituency-treebank

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含一个手动标注的约鲁巴语（Yoruba）成分树库，由1,000个句子组成。该树库是作为一项本科语言学研究项目的一部分开发的，专注于约鲁巴语句法和计算解析，特别是针对资源不足的语言。标注遵循短语结构（成分）框架，包括NP、VP、IP和CP等标签。数据集内容包含约鲁巴语句子、英语翻译、句子类型、词性标签和手动标注的成分树。句子来源多样，包括约鲁巴语语法教科书、BBC约鲁巴语等媒体资源、约鲁巴语圣经（Bibeli Mimọ）、文学文本以及口语和会话约鲁巴语。数据集可用于约鲁巴语句法的语言分析、成分解析实验、资源不足语言的NLP研究、解析器评估和微调以及教育和文档目的。数据集使用Creative Commons Attribution 4.0 International (CC BY 4.0)许可。

This dataset contains a manually annotated Yoruba constituency treebank consisting of 1,000 sentences. The treebank was developed as part of an undergraduate linguistics research project focused on Yoruba syntax and computational parsing for under-resourced languages. The annotations follow a phrase-structure (constituency) framework, including labels such as NP, VP, IP, and CP. The dataset includes Yoruba sentences, English translations, sentence types, POS tags, and manually annotated constituency trees. The sentences were selected from a range of sources to ensure syntactic diversity, including Yoruba grammar textbooks, BBC Yoruba and other media sources, the Yoruba Bible (Bibeli Mimọ), literary texts, and spoken and conversational Yoruba. The dataset can be used for linguistic analysis of Yoruba syntax, constituency parsing experiments, NLP research on under-resourced languages, parser evaluation and fine-tuning, and educational and documentation purposes. The dataset is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

提供机构：

Akindelevictoria

5,000+

优质数据集

54 个

任务类型

进入经典数据集