five

Head-word model for the CLE-UTB treebank.

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/Head-word_model_for_the_CLE-UTB_treebank_/30213250
下载链接
链接失效反馈
官方服务:
资源简介:
We address the challenge of syntactic parsing for Urdu, a morphologically rich language, and present state-of-the-art results for both constituency and dependency parsing. This paper offers four major contributions: 1) the conversion of the CLE-UTB phrase structure treebank into a dependency treebank by developing language-specific head-word and phrase-to-dependency label mapping rules; 2) a novel sequence labeling scheme that transforms the parsing task into a unified representation; 3) the training of contextualized word representations on a large 220 million tokens Urdu corpus collected from the web; and 4) development of parsing framework using two learning paradigms, single-task and multi-task learning. Several post-processing rules are applied to improve the quality of the automatically converted dependency structure treebank. The proposed sequence labeling scheme enables the use of a shared architecture that learns the syntactic structures from both grammatical structures simultaneously and hence improves generalization. Experiments show that the multi-task learning setup significantly enhances parsing performance, achieving an F1 score of 91.39 for constituency parsing (an improvement of 3.29 points) and a labeled attachment score of 85.69 for dependency parsing (an improvement of 1.49 points). These results demonstrate that learning cross-task representations provides measurable benefits and advances the state of syntactic parsing for Urdu.

本研究针对形态丰富的乌尔都语句法分析难题,提出了当前最优的成分句法分析与依存句法分析结果。本文主要包含四项核心贡献:1)通过构建适配乌尔都语的专属中心词映射规则与短语-依存标签映射规则,将CLE-UTB短语结构树库转换为依存结构树库;2)提出一种全新的序列标注方案,将句法分析任务转化为统一的表征形式;3)基于从网络采集的包含2.2亿个Token的乌尔都语语料库,训练上下文词表征模型;4)基于单任务学习与多任务学习两种学习范式,构建句法分析框架。本研究还应用了多项后处理规则,以提升自动转换得到的依存结构树库的质量。所提出的序列标注方案支持采用共享架构,可同时从两种语法结构中学习句法结构,进而提升模型的泛化能力。实验结果表明,多任务学习范式可显著提升句法分析性能:成分句法分析的F1值达到91.39(较此前提升3.29个百分点),依存句法分析的标注附着得分达到85.69(较此前提升1.49个百分点)。上述结果证明,跨任务表征学习可带来可量化的性能增益,推动乌尔都语句法分析领域的研究进展。
创建时间:
2025-09-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作