Head-word model for the CLE-UTB treebank.

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://figshare.com/articles/dataset/Head-word_model_for_the_CLE-UTB_treebank_/30213250

下载链接

链接失效反馈

官方服务：

资源简介：

We address the challenge of syntactic parsing for Urdu, a morphologically rich language, and present state-of-the-art results for both constituency and dependency parsing. This paper offers four major contributions: 1) the conversion of the CLE-UTB phrase structure treebank into a dependency treebank by developing language-specific head-word and phrase-to-dependency label mapping rules; 2) a novel sequence labeling scheme that transforms the parsing task into a unified representation; 3) the training of contextualized word representations on a large 220 million tokens Urdu corpus collected from the web; and 4) development of parsing framework using two learning paradigms, single-task and multi-task learning. Several post-processing rules are applied to improve the quality of the automatically converted dependency structure treebank. The proposed sequence labeling scheme enables the use of a shared architecture that learns the syntactic structures from both grammatical structures simultaneously and hence improves generalization. Experiments show that the multi-task learning setup significantly enhances parsing performance, achieving an F1 score of 91.39 for constituency parsing (an improvement of 3.29 points) and a labeled attachment score of 85.69 for dependency parsing (an improvement of 1.49 points). These results demonstrate that learning cross-task representations provides measurable benefits and advances the state of syntactic parsing for Urdu.

本研究针对形态丰富的乌尔都语句法分析难题，提出了当前最优的成分句法分析与依存句法分析结果。本文主要包含四项核心贡献：1）通过构建适配乌尔都语的专属中心词映射规则与短语-依存标签映射规则，将CLE-UTB短语结构树库转换为依存结构树库；2）提出一种全新的序列标注方案，将句法分析任务转化为统一的表征形式；3）基于从网络采集的包含2.2亿个Token的乌尔都语语料库，训练上下文词表征模型；4）基于单任务学习与多任务学习两种学习范式，构建句法分析框架。本研究还应用了多项后处理规则，以提升自动转换得到的依存结构树库的质量。所提出的序列标注方案支持采用共享架构，可同时从两种语法结构中学习句法结构，进而提升模型的泛化能力。实验结果表明，多任务学习范式可显著提升句法分析性能：成分句法分析的F1值达到91.39（较此前提升3.29个百分点），依存句法分析的标注附着得分达到85.69（较此前提升1.49个百分点）。上述结果证明，跨任务表征学习可带来可量化的性能增益，推动乌尔都语句法分析领域的研究进展。

创建时间：

2025-09-25

5,000+

优质数据集

54 个

任务类型

进入经典数据集